Open joelostblom opened 1 year ago
I really like the idea of cleaning up the top-level API! Thanks for bringing this up. I also find it quite overwhelming to go through the API references in the docs and to use autocompletion if you don't specifically know what you are looking for (and therefore I never use completion on alt.Chart
to browse options).
In favour of cleaning up the API references in the docs which we can maybe do without any code changes except to generate_api_docs.py
. I can't quite follow your example of refactoring FillDatum
without breaking old code, could you expand on this a bit? Would you rename that class or just move it to a new submodule and still import it in the top-level?
Exclusion from tab completion is a bit tricky. As we noticed in #2814, hiding in Jupyter works great but we haven't yet found a way to do the same in VS Code. Of course still worth the effort to adjust __all__
if at least Jupyter users can profit from it.
Would you rename that class or just move it to a new submodule and still import it in the top-level?
My hope was to move it to a submodule and import it at top level, but hide it from the top level tab completion and only allow tab completion via the submodule. I think this would allow for old code to work, while also encouraging people to use the submodules from now on as well as reduce the top-level tab completion.
Exclusion from tab completion is a bit tricky. As we noticed in https://github.com/altair-viz/altair/pull/2814, hiding in Jupyter works great but we haven't yet found a way to do the same in VS Code.
I think it would be great to support VS Code too, but if there isn't a way that pylance offers for hiding individual classes/functions, I am not sure there is much we can do until they add that. I tried asking in the discussion you linked before, let's see if there is a way.
Based on the helpful reply in https://github.com/microsoft/pylance-release/discussions/3709#discussioncomment-5093960 it seems like we need to get https://github.com/altair-viz/altair/pull/2614 or similar merged, but maybe it is worthwhile if it resolves some of the VS Code issues we have been seeing and makes it easier to maintain both VS Code and JupyterLab support?
Some notes from an offline discussion with @mattijn and @arvind:
What is very unlikely to break users code but would improve the situation a lot is removing classes from the top-level namespace which are never used by users anyway as they just represent a primitive datatype such as int
, float
. Examples: ConditionalAxisPropertyTextBaselinenull
, DictSelectionInit
, ... There are hundreds of these.
A first step could be to analyze which class names only appear ones in our code base, i.e. are defined but never used anywhere else, not even in a type hint. This can be a manual list and with every VL release, we could check in the PR if new entries get added to the top-level, if we really want them there.
Maybe we can remove these classes completely instead of just not making them available in the top-level namespace. This would make the Altair wheels even smaller. It might require a bit of a rewrite of from_dict
.
Due to the many objects, Altair takes up 23MB of memory (if I measure this correctly?) and this excludes it's requirements which are small:
from memory_profiler import profile
@profile
def import_packages():
import typing_extensions
import jinja2
import jsonschema
import packaging
import narwhals
import altair
if __name__ == "__main__":
import_packages()
Some notes from an offline discussion with @mattijn and @arvind:
There is potential in Vega-Lite to simplify the types, i.e. reduce the number of types. This would automatically improve the situation here. This can happen in VL as a first step of a potentially bigger rewrite/refactoring. However, this first requires the appropriate funding and resources so it's not around the corner.
Exciting to hear this being discussed on the Vega-Lite
-side!
@binste is there any chance this relates to better support for generic types in more recent drafts of JSON Schema
?
A while back I tried exploring what we could do in altair
to reduce the number of symbols, but it started getting tricky due to https://github.com/vega/vega-lite/blob/main/scripts/rename-schema.sh
If there is a v6
on the table, I do think it would be worth reviewing how we can better utilise typing
features - that wouldn't have been available in earlier major versions of altair
If it makes it easier for us to generate type hints in Python, we should definitely look at it once we get to that rewrite, thanks for flagging!
In general, I think their is some consensus around simplifying the type hierarchy in VL. At least for Altair, I don't think we need ConditionalParameter
, LogicalAnd
, etc. as no user will use this. Or, in case we need it for schema generation, we would burry them deep in the codebase as an implementation detail.
When we look into this, let's also review SPEC 1 which has recommendations on how to do lazy-loading submodules and functions.
When we look into this, let's also review SPEC 1 which has recommendations on how to do lazy-loading submodules and functions.
Will find the link later, but polars
has a good solution for this which preserves typing
In terms of LOC, not much more than utils._importers and is doing a whole lot more:
I haven't seen this mentioned anywhere.
We also have several modules available at multiple levels, which is making it complicated to resolve https://github.com/vega/altair/issues/3610
I discovered this yesterday and have written a summary of my findings here:
@joelostblom @binste https://github.com/vega/altair/issues/2918#issuecomment-1441570706
I think I can see a path forward here.
Building on some of the ideas from https://github.com/vega/altair/issues/3366#issuecomment-2390928260 and #3618-theme.py
alt.value
alt.AngleValue
-> alt.value.Angle
This change wouldn't cause issues with existing code, as alt.value
is currently a function.
The actual implementation could look something like:
```py from __future__ import annotations from typing import Any import altair as alt class _ValueMeta(type): """Metaclass for :class:`value`.""" @property def Angle(cls) -> type[alt.AngleValue]: return alt.AngleValue @property def Color(cls) -> type[alt.ColorValue]: return alt.ColorValue # ... for all the other `...Value` channels class value(metaclass=_ValueMeta): def __new__(cls, value: Any, **kwds: Any) -> dict[str, Any]: """Specify a value for use in an encoding.""" return dict(value=value, **kwds) ```
```py examples = ( value(1), value(1, offset=-5), value.Angle(0), value.Color("red"), value.Angle(0.5).condition(), value.Color("steelblue").condition(), ) >>> examples ({'value': 1}, {'value': 1, 'offset': -5}, AngleValue({ value: 0 }), ColorValue({ value: 'red' }), AngleValue({ condition: {}, value: 0.5 }), ColorValue({ condition: {}, value: 'steelblue' })) ```
![Image](https://github.com/user-attachments/assets/8b15d5df-b64c-4654-b177-34c5852957a3)
So this change alone is just a very lightweight rewrapping of the existing API.
A v6
version might want to look at changing the wrapped class names and/or reprs.
alt.datum
As mentioned by @joelostblom
- Moved the top level classes to methods on the API function: instead of
AngleDatum
, we would havealt.datum.Angle
.Maybe the second is lightly more natural in syntax, but it will probably make it confusing since you can use
datum.<colum_name>
in expression strings, so the first would be better for that reason.
I originally mistook this for talking about a naive approach at supporting both.
For reference, that would look like:
```py from __future__ import annotations from typing import Any import altair as alt from altair.expr.core import GetAttrExpression, GetItemExpression class _DatumMeta(type): """Metaclass for :class:`datum`.""" @property def Angle(cls) -> type[alt.AngleDatum]: return alt.AngleDatum @property def Color(cls) -> type[alt.ColorDatum]: return alt.ColorDatum def __repr__(self) -> str: return "datum" def __getattr__(self, attr) -> GetAttrExpression: if attr.startswith("__") and attr.endswith("__"): raise AttributeError(attr) return GetAttrExpression("datum", attr) def __getitem__(self, attr) -> GetItemExpression: return GetItemExpression("datum", attr) class datum(metaclass=_DatumMeta): def __new__(cls, datum: Any, **kwds: Any) -> dict[str, Any]: """Specify a datum for use in an encoding.""" return dict(datum=datum, **kwds) ```
```py examples_datum = ( datum, datum(1), datum(1, type="quantitative"), datum.Angle(1), datum.Color("green"), datum.Angle(0.2).bandPosition(0.8), datum.column_name, datum["column with spaces"], ) >>> examples_datum (datum, {'datum': 1}, {'datum': 1, 'type': 'quantitative'}, AngleDatum({ datum: 1 }), ColorDatum({ datum: 'green' }), AngleDatum({ bandPosition: 0.8, datum: 0.2 }), datum.column_name, datum['column with spaces']) ```
![Image](https://github.com/user-attachments/assets/93e99098-7b74-4fc7-ae08-2b3821d6c19a)
What if you have a column named "Color"
, or any of the other channel names?
```py
v6_color, v5_color = datum.Color, alt.datum.Color
>>> print(
f"v6 returns a class: {v6_color.__name__!r}\n "
f"value: {v6_color!r}\n"
f"v5 returns an instance of: {type(v5_color).__name__!r}\n "
f"value: {v5_color!r}\n"
)
v6 returns a class: 'ColorDatum'
value:
v5
```py import polars as pl df = pl.DataFrame( { "x": range(8), "y": [2, 5, 4, 7, 8, 3, 4, 5], "Color": [ "red", "blue", "green", "steelblue", "olive", "yellow", "red", "green", ], } ) base = alt.Chart(df).mark_bar().encode(x="x:N", y="y") v5_predicate = alt.datum.Color == "red" v5_chart = base.encode( color=alt.when(v5_predicate) .then(alt.value("black")) .otherwise("Color", legend=None) ) ```
v6
```py >>> base.encode( color=alt.when(datum.Color == "red") .then(alt.value("black")) .otherwise("Color", legend=None) ) TypeError: Expected a predicate, but got: 'bool' From `predicate=False`. ```
v6
```py >>> base.encode( color=alt.when(datum["Color"] == "red") .then(alt.value("black")) .otherwise("Color", legend=None) ) ``` ![Image](https://github.com/user-attachments/assets/4317e1d9-0c0c-43cf-9007-8f5e117e54d6)
alt.datum.___
will mean different things, depending on if the name you use collides.datum[...]
to avoid the collision.alt.datum.__getattr__(...)
for this purpose
alt.datum.__getitem__(...)
.
Additionally:
python
keywordsalt.datum
.alt.datum.___
GetAttrExpression
in all other casesExpression
is currently expectedalt.___Datum
classes
OperatorMixin
, but these methods would be available on the classOf these options I strongly prefer [2]. Despite any potential disruption, it makes things less ambiguous and requires the least explanation to new users.
I think [1] would be short-sighted and a point of confusion for even trivial cases like: alt.datum.color
vs alt.datum.Color
.
Both [3] and [4] would introduce more complexity internally - even if they solved the issue of not breaking existing code.
I know there is a lot go through here, so no rush on a response 😅
Hope the examples are helpful. Even if we don't move forward with this idea; maybe they spark an idea for someone else.
Currently, the top level of the Altair API is huge, there are 573 entries in the tab completion menu of
alt.<TAB>
. This can make it rather hard to find what you're looking for unless you know the first few character of the name already, and exploring the API from scratch as a new user must be intimidating. Many of these seem unhelpful to have exposed at the top-level, e.g. there are 70 different entries that relate to conditions, but I only usealt.condition
myself. I suggest that the top level API only includes the most relevant classes and functions that we think are helpful on a day to day basis, and the rest go under sub-modules.The Altair API consists of four parts:
I have listed my comments below on how we could reduce these, and I would love to hear what you think @mattijn, @ChristopherDavisUCI, @binste (and anyone else that wants to chime in)
API functions
Introduced specifically by Altair as conveniences and we should keep all of them (although maybe
check_fields_and_encodings
is meant to be a private function based on the name and since there is no docstring?).(just showing the first few)
Top level objects
The are only eight of these and we can probably keep all of them. Alternatively, since all except
Chart
have corresponding API functions and are rarely used directly we could move them to a charts sub domain:alt.charts.HConcatChart
, etc, to make it more clear that the preferred usage is the API methodalt.hconcat
, etc.Encoding channels
There is wasted space here by having each channel repeated three times, e.g.
Angle
,AngleDatum
,AngleValue
. Especially as the most common use of datum and value is by using the top level API functionsalt.datum
andalt.value
. I see to it possible ways of improving the situation:AngleDatum
we haveAngle.datum
.AngleDatum
, we would havealt.datum.Angle
.Maybe the second is lightly more natural in syntax, but it will probably make it confusing since you can use
datum.<colum_name>
in expression strings, so the first would be better for that reason.(just showing the first few)
Low level schema wrappers
This is where we could make the most progress. There are around 400 low level schemer wrappers of which I've only used around 10 myself, most of which are now unnecessary since we added the method based syntax for channel encoding options.
We should identify which of these are just internal helper classes for Altair that end users rarely need to know about. To me it seems like almost everything here could be moved to a submodule, except the classes that end in
Params
(although they could also be moved if we introduce aliases).I don't know what a good name of of this submodule might be: maybe
alt.extras
,alt.schema_wrappers
,alt.misc
,alt.wrappers
, or similar?(just showing the first few)
Again, none of these changes should break old code. For example, if we move
FillDatum
toalt.Fill.datum
, then there should no longer be any tab completion foralt.FillDatum
, but old code that use it should work as it just callsalt.Fill.datum
under the hood. The tab completion should be onalt.Fill.<tab>
instead, which also seems more natural to me.