vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.32k stars 794 forks source link

Reduce the number of entries available in the top level API #2918

Open joelostblom opened 1 year ago

joelostblom commented 1 year ago

Currently, the top level of the Altair API is huge, there are 573 entries in the tab completion menu of alt.<TAB>. This can make it rather hard to find what you're looking for unless you know the first few character of the name already, and exploring the API from scratch as a new user must be intimidating. Many of these seem unhelpful to have exposed at the top-level, e.g. there are 70 different entries that relate to conditions, but I only use alt.condition myself. I suggest that the top level API only includes the most relevant classes and functions that we think are helpful on a day to day basis, and the rest go under sub-modules.

The Altair API consists of four parts:

  1. API functions
  2. Top level objects
  3. Encoding channels
  4. Low level schema wrappers

I have listed my comments below on how we could reduce these, and I would love to hear what you think @mattijn, @ChristopherDavisUCI, @binste (and anyone else that wants to chime in)

Note: In all the changes I suggest below we should not break any old code, just remove the class/function from tab completion and link it to the new place in the API where this class/function now resides.

API functions

Introduced specifically by Altair as conveniences and we should keep all of them (although maybe check_fields_and_encodings is meant to be a private function based on the name and since there is no docstring?).

image (just showing the first few)

Top level objects

The are only eight of these and we can probably keep all of them. Alternatively, since all except Chart have corresponding API functions and are rarely used directly we could move them to a charts sub domain: alt.charts.HConcatChart, etc, to make it more clear that the preferred usage is the API method alt.hconcat, etc.

image

Encoding channels

There is wasted space here by having each channel repeated three times, e.g. Angle, AngleDatum, AngleValue. Especially as the most common use of datum and value is by using the top level API functions alt.datum and alt.value. I see to it possible ways of improving the situation:

  1. Move the top level classes to methods on the corresponding encoding class: instead of AngleDatum we have Angle.datum.
  2. Moved the top level classes to methods on the API function: instead of AngleDatum, we would have alt.datum.Angle.

Maybe the second is lightly more natural in syntax, but it will probably make it confusing since you can use datum.<colum_name> in expression strings, so the first would be better for that reason.

image (just showing the first few)

Low level schema wrappers

This is where we could make the most progress. There are around 400 low level schemer wrappers of which I've only used around 10 myself, most of which are now unnecessary since we added the method based syntax for channel encoding options.

We should identify which of these are just internal helper classes for Altair that end users rarely need to know about. To me it seems like almost everything here could be moved to a submodule, except the classes that end in Params (although they could also be moved if we introduce aliases).

I don't know what a good name of of this submodule might be: maybe alt.extras, alt.schema_wrappers, alt.misc, alt.wrappers, or similar?

image (just showing the first few)


Again, none of these changes should break old code. For example, if we move FillDatum to alt.Fill.datum, then there should no longer be any tab completion for alt.FillDatum, but old code that use it should work as it just calls alt.Fill.datum under the hood. The tab completion should be on alt.Fill.<tab> instead, which also seems more natural to me.

binste commented 1 year ago

I really like the idea of cleaning up the top-level API! Thanks for bringing this up. I also find it quite overwhelming to go through the API references in the docs and to use autocompletion if you don't specifically know what you are looking for (and therefore I never use completion on alt.Chart to browse options).

In favour of cleaning up the API references in the docs which we can maybe do without any code changes except to generate_api_docs.py. I can't quite follow your example of refactoring FillDatum without breaking old code, could you expand on this a bit? Would you rename that class or just move it to a new submodule and still import it in the top-level?

Exclusion from tab completion is a bit tricky. As we noticed in #2814, hiding in Jupyter works great but we haven't yet found a way to do the same in VS Code. Of course still worth the effort to adjust __all__ if at least Jupyter users can profit from it.

joelostblom commented 1 year ago

Would you rename that class or just move it to a new submodule and still import it in the top-level?

My hope was to move it to a submodule and import it at top level, but hide it from the top level tab completion and only allow tab completion via the submodule. I think this would allow for old code to work, while also encouraging people to use the submodules from now on as well as reduce the top-level tab completion.

Exclusion from tab completion is a bit tricky. As we noticed in https://github.com/altair-viz/altair/pull/2814, hiding in Jupyter works great but we haven't yet found a way to do the same in VS Code.

I think it would be great to support VS Code too, but if there isn't a way that pylance offers for hiding individual classes/functions, I am not sure there is much we can do until they add that. I tried asking in the discussion you linked before, let's see if there is a way.

joelostblom commented 1 year ago

Based on the helpful reply in https://github.com/microsoft/pylance-release/discussions/3709#discussioncomment-5093960 it seems like we need to get https://github.com/altair-viz/altair/pull/2614 or similar merged, but maybe it is worthwhile if it resolves some of the VS Code issues we have been seeing and makes it easier to maintain both VS Code and JupyterLab support?

binste commented 1 month ago

Some notes from an offline discussion with @mattijn and @arvind:

What is very unlikely to break users code but would improve the situation a lot is removing classes from the top-level namespace which are never used by users anyway as they just represent a primitive datatype such as int, float. Examples: ConditionalAxisPropertyTextBaselinenull, DictSelectionInit, ... There are hundreds of these.

A first step could be to analyze which class names only appear ones in our code base, i.e. are defined but never used anywhere else, not even in a type hint. This can be a manual list and with every VL release, we could check in the PR if new entries get added to the top-level, if we really want them there.

Maybe we can remove these classes completely instead of just not making them available in the top-level namespace. This would make the Altair wheels even smaller. It might require a bit of a rewrite of from_dict.

binste commented 1 month ago

Due to the many objects, Altair takes up 23MB of memory (if I measure this correctly?) and this excludes it's requirements which are small:

from memory_profiler import profile

@profile
def import_packages():
    import typing_extensions
    import jinja2
    import jsonschema
    import packaging
    import narwhals

    import altair

if __name__ == "__main__":
    import_packages()

image

dangotbanned commented 1 month ago

Some notes from an offline discussion with @mattijn and @arvind:

There is potential in Vega-Lite to simplify the types, i.e. reduce the number of types. This would automatically improve the situation here. This can happen in VL as a first step of a potentially bigger rewrite/refactoring. However, this first requires the appropriate funding and resources so it's not around the corner.

Exciting to hear this being discussed on the Vega-Lite-side!

@binste is there any chance this relates to better support for generic types in more recent drafts of JSON Schema?

A while back I tried exploring what we could do in altair to reduce the number of symbols, but it started getting tricky due to https://github.com/vega/vega-lite/blob/main/scripts/rename-schema.sh

Screenshots ![image](https://github.com/user-attachments/assets/da311919-567e-4609-b976-cc01422db30c) ![image](https://github.com/user-attachments/assets/59496a7f-39d5-4824-9b09-a22a8163d6d5)

If there is a v6 on the table, I do think it would be worth reviewing how we can better utilise typing features - that wouldn't have been available in earlier major versions of altair

binste commented 1 month ago

If it makes it easier for us to generate type hints in Python, we should definitely look at it once we get to that rewrite, thanks for flagging!

In general, I think their is some consensus around simplifying the type hierarchy in VL. At least for Altair, I don't think we need ConditionalParameter, LogicalAnd, etc. as no user will use this. Or, in case we need it for schema generation, we would burry them deep in the codebase as an implementation detail.

binste commented 1 month ago

When we look into this, let's also review SPEC 1 which has recommendations on how to do lazy-loading submodules and functions.

dangotbanned commented 1 month ago

When we look into this, let's also review SPEC 1 which has recommendations on how to do lazy-loading submodules and functions.

Will find the link later, but polars has a good solution for this which preserves typing

Edit

In terms of LOC, not much more than utils._importers and is doing a whole lot more:

dangotbanned commented 1 month ago

I haven't seen this mentioned anywhere.

We also have several modules available at multiple levels, which is making it complicated to resolve https://github.com/vega/altair/issues/3610

I discovered this yesterday and have written a summary of my findings here:

dangotbanned commented 4 weeks ago

@joelostblom @binste https://github.com/vega/altair/issues/2918#issuecomment-1441570706

I think I can see a path forward here.

Building on some of the ideas from https://github.com/vega/altair/issues/3366#issuecomment-2390928260 and #3618-theme.py

alt.value

alt.AngleValue -> alt.value.Angle This change wouldn't cause issues with existing code, as alt.value is currently a function. The actual implementation could look something like:

Definitions

```py from __future__ import annotations from typing import Any import altair as alt class _ValueMeta(type): """Metaclass for :class:`value`.""" @property def Angle(cls) -> type[alt.AngleValue]: return alt.AngleValue @property def Color(cls) -> type[alt.ColorValue]: return alt.ColorValue # ... for all the other `...Value` channels class value(metaclass=_ValueMeta): def __new__(cls, value: Any, **kwds: Any) -> dict[str, Any]: """Specify a value for use in an encoding.""" return dict(value=value, **kwds) ```

Usage

```py examples = ( value(1), value(1, offset=-5), value.Angle(0), value.Color("red"), value.Angle(0.5).condition(), value.Color("steelblue").condition(), ) >>> examples ({'value': 1}, {'value': 1, 'offset': -5}, AngleValue({ value: 0 }), ColorValue({ value: 'red' }), AngleValue({ condition: {}, value: 0.5 }), ColorValue({ condition: {}, value: 'steelblue' })) ```

Screenshot

![Image](https://github.com/user-attachments/assets/8b15d5df-b64c-4654-b177-34c5852957a3)

So this change alone is just a very lightweight rewrapping of the existing API. A v6 version might want to look at changing the wrapped class names and/or reprs.

alt.datum

As mentioned by @joelostblom

  1. Moved the top level classes to methods on the API function: instead of AngleDatum, we would have alt.datum.Angle.

Maybe the second is lightly more natural in syntax, but it will probably make it confusing since you can use datum.<colum_name> in expression strings, so the first would be better for that reason.

I originally mistook this for talking about a naive approach at supporting both.

For reference, that would look like:

Definitions

```py from __future__ import annotations from typing import Any import altair as alt from altair.expr.core import GetAttrExpression, GetItemExpression class _DatumMeta(type): """Metaclass for :class:`datum`.""" @property def Angle(cls) -> type[alt.AngleDatum]: return alt.AngleDatum @property def Color(cls) -> type[alt.ColorDatum]: return alt.ColorDatum def __repr__(self) -> str: return "datum" def __getattr__(self, attr) -> GetAttrExpression: if attr.startswith("__") and attr.endswith("__"): raise AttributeError(attr) return GetAttrExpression("datum", attr) def __getitem__(self, attr) -> GetItemExpression: return GetItemExpression("datum", attr) class datum(metaclass=_DatumMeta): def __new__(cls, datum: Any, **kwds: Any) -> dict[str, Any]: """Specify a datum for use in an encoding.""" return dict(datum=datum, **kwds) ```

Usage

```py examples_datum = ( datum, datum(1), datum(1, type="quantitative"), datum.Angle(1), datum.Color("green"), datum.Angle(0.2).bandPosition(0.8), datum.column_name, datum["column with spaces"], ) >>> examples_datum (datum, {'datum': 1}, {'datum': 1, 'type': 'quantitative'}, AngleDatum({ datum: 1 }), ColorDatum({ datum: 'green' }), AngleDatum({ bandPosition: 0.8, datum: 0.2 }), datum.column_name, datum['column with spaces']) ```

Screenshot

![Image](https://github.com/user-attachments/assets/93e99098-7b74-4fc7-ae08-2b3821d6c19a)

Issues

What if you have a column named "Color", or any of the other channel names?

Introspection

```py v6_color, v5_color = datum.Color, alt.datum.Color >>> print( f"v6 returns a class: {v6_color.__name__!r}\n " f"value: {v6_color!r}\n" f"v5 returns an instance of: {type(v5_color).__name__!r}\n " f"value: {v5_color!r}\n" ) v6 returns a class: 'ColorDatum' value: v5 returns an instance of: 'GetAttrExpression' value: datum.Color ```

Works for v5

```py import polars as pl df = pl.DataFrame( { "x": range(8), "y": [2, 5, 4, 7, 8, 3, 4, 5], "Color": [ "red", "blue", "green", "steelblue", "olive", "yellow", "red", "green", ], } ) base = alt.Chart(df).mark_bar().encode(x="x:N", y="y") v5_predicate = alt.datum.Color == "red" v5_chart = base.encode( color=alt.when(v5_predicate) .then(alt.value("black")) .otherwise("Color", legend=None) ) ```

Would break for v6

```py >>> base.encode( color=alt.when(datum.Color == "red") .then(alt.value("black")) .otherwise("Color", legend=None) ) TypeError: Expected a predicate, but got: 'bool' From `predicate=False`. ```

Would still work for v6

```py >>> base.encode( color=alt.when(datum["Color"] == "red") .then(alt.value("black")) .otherwise("Color", legend=None) ) ``` ![Image](https://github.com/user-attachments/assets/4317e1d9-0c0c-43cf-9007-8f5e117e54d6)

Possible solutions

  1. Do (almost) nothing.
  2. Deprecate alt.datum.__getattr__(...) for this purpose
    • Maybe controversial, but very possible given a MAJOR version is expected for this anyway.
    • Unless I'm missing something, it seems the same functionality is provided by alt.datum.__getitem__(...). Additionally:
      • Support for spaces and python keywords
      • (Soon) Support for alt.Parameter
    • Additional benefit of there being only one obvious way to reference a column via alt.datum.
  3. Return some kind of a proxy object from alt.datum.___
    • Acts like the wrapped channels class when called
    • Acts like GetAttrExpression in all other cases
    • Handling of this behavior would need to be added everywhere an Expression is currently expected
  4. Add another metaclass, to use for alt.___Datum classes
    • Somewhat like OperatorMixin, but these methods would be available on the class
    • I can see this working on a technical-level, but probably not intuitive to use

Of these options I strongly prefer [2]. Despite any potential disruption, it makes things less ambiguous and requires the least explanation to new users.

I think [1] would be short-sighted and a point of confusion for even trivial cases like: alt.datum.color vs alt.datum.Color.

Both [3] and [4] would introduce more complexity internally - even if they solved the issue of not breaking existing code.

Summary

I know there is a lot go through here, so no rush on a response 😅

Hope the examples are helpful. Even if we don't move forward with this idea; maybe they spark an idea for someone else.