vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.4k stars 795 forks source link

`_EncodingMixin.encode` types defined too narrow #3552

Closed dangotbanned closed 2 months ago

dangotbanned commented 3 months ago

What happened?

Noticed this regression since https://github.com/vega/altair/pull/3515 while writing the test for https://github.com/vega/altair/pull/3394#issuecomment-2302995453

image

Minimal repro

import altair as alt

alt.Chart().encode(color=alt.when(x=1).then(alt.value("grey")))

What would you like to happen instead?

The types @binste and I defined for EncodeKwds were too strict and didn't account for alt.(condition|when).

The simplest fix would be to add SchemaBase to each annotation - as that is what Then falls into

Which version of Altair are you using?

https://github.com/vega/altair/commit/c984002d500f42ea14ba38e087f0c746af5e3722

binste commented 3 months ago

Nice catch!

Looking a bit into it, I think it's not a regression as color (and other channels) were not typed with SchemaBase before as well. It worked/still works for alt.condition if it returns a dict, which it does in some cases, and not SchemaBase.

Adding SchemaBase as a hint sounds pragmatic to me. Feel free to go for it! I'm mostly without reception the next 2 days but can check in again on Sunday. If you want to merge this and release before that, that's ok with me. We can narrow the type hint again in the future if we have a better idea.

dangotbanned commented 3 months ago

Nice catch!

Looking a bit into it, I think it's not a regression as color (and other channels) were not typed with SchemaBase before as well. It worked/still works for alt.condition if it returns a dict, which it does in some cases, and not SchemaBase.

Huh you're right @binste. I spent so much time with the @overload(s) on alt.(condition|when) that I was sure SchemaBase was here before.

Adding SchemaBase as a hint sounds pragmatic to me. Feel free to go for it! I'm mostly without reception the next 2 days but can check in again on Sunday. If you want to merge this and release before that, that's ok with me. We can narrow the type hint again in the future if we have a better idea.

There are a few things we can try (will update tomorrow with some alternatives) but I think we risk making .encode() too noisy in terms of annotations again

Update

Below represent color, renamed just to keep the same width for name here

# New alias def
Schema_3: TypeAlias = Map | SchemaBase

# Alternative idea I had for `Then`
# Instead of inheriting from `SchemaBase`
@runtime_checkable
class SchemaLike(Protocol):
    _schema: ClassVar[dict[Literal["type"], Literal["object"]]] = {"type": "object"}

    def to_dict(self, *args, **kwds) -> Any: ...

# How we'd use `SchemaLike` in annotations
Schema_4: TypeAlias = SchemaBase | SchemaLike

# Combining 4 & 5 essentially
Schema_5: TypeAlias = Map | SchemaBase | SchemaLike

class _EncodingMixin
    def encode(
        self, 
        col_0: Optional[str | Color | Map | ColorDatum | ColorValue] = Undefined,
        col_1: Optional[str | Color | Map | SchemaBase | ColorDatum | ColorValue] = Undefined,
        col_2: Optional[str | Color | Map | Then[Any] | ColorDatum | ColorValue] = Undefined,
        col_3: Optional[str | Color | Schema_3 | ColorDatum | ColorValue] = Undefined,
        col_4: Optional[str | Color | Map | Schema_4 | ColorDatum | ColorValue] = Undefined,
        col_5: Optional[str | Color | Schema_5 | ColorDatum | ColorValue] = Undefined,
    ): ...

Specifically on the SchemaLike idea, I'm referring to a stash I have left over. Would be quite easy to integrate, and would also be useful in simplifying this check in _todict() 3275c5e (#3501):

Screenshot

![image](https://github.com/user-attachments/assets/fd40535b-4b16-4a6f-95d0-4e58b0550e2d)

binste commented 2 months ago

For the above reasons, I currently prefer col_1. It solves the type errors, although the hints will be too broad. But I don't see a simple way of fixing that without changing the return types of alt.condition which sounds like a larger issue. Not sure if that's worth tackling right now. What do you think?

dangotbanned commented 2 months ago

Really appreciate you taking the time to think this over @binste

Static/type checking

col_4 and col_5: If I understand it correctly, all SchemaBase would pass as SchemaLike. Hence, if we use SchemaLike, we are as permissive as if we would just use SchemaBase directly, even a bit more. As SchemaBase is already too broad of a type hint, I'm not sure if we get any mileage out of the protocol.

I've revised the original idea a little and added some examples over on #3567 to demo usage.

So far, that covers what mypy|pyright think - but the other component is the runtime checking.


Runtime

With @runtime_checkable, you are right that any SchemaBase will still pass. However, we do already work around this for other protocols e.g. for SupportsGeoInterface

alt.utils.data.to_values

https://github.com/vega/altair/blob/5207768b6e533c0509218376942309d1c7bac22f/altair/utils/data.py#L311-L332

It is mostly just thinking about which might need to take priority for a particular use.

Reusing the _to_dict example from before, SchemaBase already takes priority. SchemaLike would be replacing the hasattr(obj, "to_dict") branch - or both could be merged - since all we need is the method found in the protocol.

tools.schemapi.schemapi._todict

https://github.com/vega/altair/blob/5207768b6e533c0509218376942309d1c7bac22f/tools/schemapi/schemapi.py#L501-L536

Additional POC

An earlier version of Then worked like this using a _ThenLike protocol

Edit

I'm realising now that I forgot to include another motivating factor for this change.

In this comment I gave some examples of how inheriting from SchemaBase made the autocompletion for methods less helpful.

If we instead inherited from SchemaLike, most of the additional methods would not appear. I think they aren't particularly helpful to have - when we only need to_dict().

As a user, the only ones that are important are Then.when and Then.otherwise.

binste commented 2 months ago

The draft PR is very helpful! Trying to summarise/reword it for myself:

How about

@runtime_checkable
class Condition(Protocol):
   """We use this protocol to signal to a user that they can pass any condition created by either `alt.condition` or `alt.when`.

   Other SchemaBase instances will pass as `Condition` as well and so for type checkers it won't matter but using this protocol when only conditions are expected at least gives a better visual indication to users.
   """
    _schema: ClassVar[_SchemaLikeDict] = {"type": "object"}

    def to_dict(self, *args, **kwds) -> Any: ...

def chart_encode(
   col_7: Optional[str | Color | Condition | Map | ColorDatum | ColorValue] = Undefined,
)

Condition is not yet used anywhere in the codebase and it makes it clearer what we intend to do with the protocol. What do you think?

dangotbanned commented 2 months ago

Feedback

The draft PR is very helpful! Trying to summarise/reword it for myself: ...

No problem @binste, yeah you've got the assignment summarized well 😉

How about

Code block > ```python > @runtime_checkable > class Condition(Protocol): > """We use this protocol to signal to a user that they can pass any condition created by either `alt.condition` or `alt.when`. > > Other SchemaBase instances will pass as `Condition` as well and so for type checkers it won't matter but using this protocol when only conditions are expected at least gives a better visual indication to users. > """ > _schema: ClassVar[_SchemaLikeDict] = {"type": "object"} > > def to_dict(self, *args, **kwds) -> Any: ... > > > def chart_encode( > col_7: Optional[str | Color | Condition | Map | ColorDatum | ColorValue] = Undefined, > ) > ``` >

Now this is interesting ... Big +1 for the visual clarity - I think that should be a high priority.

Condition is not yet used anywhere in the codebase and it makes it clearer what we intend to do with the protocol. What do you think?

So it looks like we approached this from different angles. Where I went for SchemaLike to represent SchemaBase, your proposal is more narrowly focused on conditions themselves.

My immediate thought was combining the two, where internally all we care about is .to_dict() but externally we are more specific:

Condition(SchemaLike) ```py from typing import Any, ClassVar, Dict, Literal from typing_extensions import Protocol, TypeAlias, runtime_checkable, LiteralString _SchemaLikeDict: TypeAlias = Dict[Literal["type"], LiteralString] @runtime_checkable class SchemaLike(Protocol): _schema: ClassVar[_SchemaLikeDict] def to_dict(self, *args, **kwds) -> Any: ... @runtime_checkable class Condition(SchemaLike): _schema: ClassVar[dict[Literal["type"], Literal["object"]]] = {"type": "object"} ```

We could also reuse SchemaLike elsewhere in OperatorMixin|Expression|DatumType|Parameter

Technical

Have you had a chance to look at the implementation of when-then-otherwise yet?

api.py https://github.com/vega/altair/blob/a7c227b08b29b393ba2b0eeeda69959f2eb28c3b/altair/vegalite/v5/api.py#L629-L1252
Implementation notes

The particular issue is that - `.when()` is **always** an intermediate step - `.then()` returns an object that *can* represent a condition **or** an intermediate step - `.otherwise()` is **always** a final step To pull this off (*currently*) - `When|ChainedWhen` are not recognised by any of the API - `Then` is converted to the wrapped `dict` via `.to_dict()` if used in an encoding - `.otherwise()` returns the final wrapped `dict`

Apologies if you already understood the above, but if not, does it cause you to reconsider?

I really like the idea of annotating as Condition - but I think that would need to be the name of a TypeAlias - to reflect the result of .otherwise():

@runtime_checkable
class _ConditionLike(SchemaLike):
    _schema: ClassVar[dict[Literal["type"], Literal["object"]]] = {"type": "object"}

Condition: TypeAlias = _ConditionLike | api._Conditional[Any]

This definition could then also reflect the return of alt.condition() - since that can return a dict. I know that they would already be covered by Map - but I think a name as broad as Condition should capture all the acceptable states.

Edit

Alternative names: Conditional, IntoCondition


Apologies for the essay @binste! 😅 But your idea gave me a lot to think about and I'm glad we didn't rush this into v5.4.1

binste commented 2 months ago

I thought it was ok to just use Map to cover otherwise and alt.condition (in the usual cases) but agree that it's even better to also have it in Condition -> You're TypeAlias idea sounds great!

dangotbanned commented 2 months ago

I thought it was ok to just use Map to cover otherwise and alt.condition (in the usual cases) but agree that it's even better to also have it in Condition -> You're TypeAlias idea sounds great!

Thanks happy to hear it @binste

I'll loop back to this but definitely want to have this fixed for v5.5.0. For reference on the name of this alias, these were why I'd be leaning away from Condition:

Using Into... would align with IntoExpression: https://github.com/vega/altair/blob/df14929075b45233126f4cfe579c139e0b7f0559/altair/expr/core.py#L240

Maybe this annotation represents:

"anything that can convert into a conditional encoding/property"


None of these directly collide with Condition, but there is somewhat of a hierarchy to components of conditions defined: https://github.com/vega/altair/blob/df14929075b45233126f4cfe579c139e0b7f0559/altair/vegalite/v5/api.py#L629-L660 https://github.com/vega/altair/blob/df14929075b45233126f4cfe579c139e0b7f0559/altair/vegalite/v5/api.py#L523-L537

binste commented 2 months ago

I don't have a strong preference here. IntoCondition is also fine. It could also be a type alias which we do not expose externally and then we should just aim for what provides the best UX and can always change it later if we want to broaden it's scope.

dangotbanned commented 2 months ago

Will be moving onto this issue next, just noting here another internal typing issue for when-then-otherwise that I'll attempt to fix in that PR:

Screenshot ![image](https://github.com/user-attachments/assets/d9473d18-a5bf-474f-838c-c5a9da9db74c)

I'm only getting this warning after correctly enabling support for __extra_items__ in https://github.com/vega/altair/pull/3585.

I used a different benefit of the feature in fix: Support hyphenated keys in TypedDict(s) (0bdfa72 (#3536))