machow / siuba

Python library for using dplyr like syntax with pandas and SQL
https://siuba.org
MIT License
1.14k stars 48 forks source link

Add type annotations to siuba verbs and improve type compliance #447

Open sotte opened 1 year ago

sotte commented 1 year ago

First of all, I really like siuba and I use it a lot :pray:

siuba would be even better, if it a) had type annotations for the core verbs and b) type checkers like pyright/mypy/etc would not complain so much. (I tend to turn off type checking for my siuba code blocks.)

This ticket is related to https://github.com/machow/siuba/issues/418


Note: siuba uses decorators and singledispatch quite a bit. Sadly, getting type annotations right with decorators can be a bit painful in my experience :/

sotte commented 1 year ago

I just played with overload and singledispatch. This is mostly a note to myself, and maybe it's helpful for others.

The combination of singledispatch and overload is tricky and mypy and pyright react different.

Here is a minimal example (still quite verbose) with the right type annotations. The implementation is not as nice as singledispatch2/verb_dispatch though.

from functools import singledispatch
from typing import overload

from typing_extensions import reveal_type
import pandas as pd

@singledispatch
def _mutate(x: None | int | pd.DataFrame | pd.Series) -> None | int | pd.DataFrame:
    raise TypeError(type(x))

@_mutate.register
def _(x: None) -> None:
    return x

@_mutate.register
def _(x: int) -> int:
    return x + 1

@_mutate.register
def _(x: pd.DataFrame) -> pd.DataFrame:
    return x

@_mutate.register
def _(x: pd.Series) -> pd.DataFrame:
    return x.to_frame()

@overload
def mutate(x: None) -> None: ...
@overload
def mutate(x: int) -> int: ...
@overload
def mutate(x: pd.Series) -> pd.DataFrame: ...
@overload
def mutate(x: pd.DataFrame) -> pd.DataFrame: ...

def mutate(x: None | int | pd.DataFrame | pd.Series) -> None | int | pd.DataFrame:
    return _mutate(x)

if __name__ == "__main__":
    res = mutate(None)
    reveal_type(res)

    res = mutate(15)
    reveal_type(res)

    data = pd.DataFrame({})
    # assert isinstance(data, pd.DataFrame)
    res = mutate(data)
    reveal_type(res)

    data = pd.Series(data=[1.0], index=[0])
    res = mutate(data)
    reveal_type(res)

This works with pyright

pyright baz.py
...
pyright 1.1.272
/home/stefan/tmp/advanced_typing/baz.py
  /home/stefan/tmp/advanced_typing/baz.py:49:17 - information: Type of "res" is "None"
  /home/stefan/tmp/advanced_typing/baz.py:52:17 - information: Type of "res" is "int"
  /home/stefan/tmp/advanced_typing/baz.py:57:17 - information: Type of "res" is "DataFrame"
  /home/stefan/tmp/advanced_typing/baz.py:61:17 - information: Type of "res" is "DataFrame"
0 errors, 0 warnings, 4 informations
Completed in 1.32sec

but mypy complaints

mypy baz.py
baz.py:49: note: Revealed type is "None"
baz.py:52: note: Revealed type is "builtins.int"
baz.py:56: error: Incompatible types in assignment (expression has type "DataFrame", variable has type "Optional[int]")
baz.py:57: note: Revealed type is "builtins.int"
baz.py:59: error: Incompatible types in assignment (expression has type "Series[Any]", variable has type "DataFrame")
baz.py:60: error: Incompatible types in assignment (expression has type "DataFrame", variable has type "Optional[int]")
baz.py:61: note: Revealed type is "builtins.int"
Found 3 errors in 1 file (checked 1 source file)

I guess in a perfect world one would like a decorator @siuba_verb that combines the overload and the singledispatch.

machow commented 1 year ago

Thanks for the nudge and overload example! I'll add overloads to at least resolve #418.

There are some very relevant mypy issues open for better support singledispatch support:

I think the issue with overload is that it needs to be with the initial function declaration, but singledispatch.register can happen across modules / libraries. From the issues though, it seems like there many be a few reasonable solutions..