owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
58 stars 18 forks source link

typing: missing typing in some Table methods #2879

Open lucasrodes opened 6 days ago

lucasrodes commented 6 days ago

Our linter complains about the type returned by some Table methods, which are inherited from DataFrame.

I've been trying to fix this in https://github.com/owid/etl/pull/2854/. I'm not sure how to further adapt the code so that the CI/CD doesn't complain. Any ideas?

Selecting table subset

def example(tb: Table) → Table:
  return tb[["country"]]

Error:

Expression of type "DataFrame" is incompatible with return type "Table"
  "DataFrame" is incompatible with "Table"Pylance(https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportReturnType)

I figured this can be solved with __getitem__.

drop_duplicates

def example(tb: Table) -> Table:
    return tb.drop_duplicates()

Error:

Expression of type "DataFrame" is incompatible with return type "Table"
  "DataFrame" is incompatible with "Table"Pylance(https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportReturnType)

dropna

def example(tb: Table) -> Table:
    return tb.dropna()

Error:

Expression of type "Table | None" is incompatible with return type "Table"
  Type "Table | None" is incompatible with type "Table"
    "None" is incompatible with "Table"Pylance(https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportReturnType)
lucasrodes commented 6 days ago

FYI: @larsyencken @Marigold

Marigold commented 6 days ago

Pandas discourage use of [] in their docs:

The Python and NumPy indexing operators [] and attribute operator . provide quick and easy access to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommended that you take advantage of the optimized pandas data access methods exposed in this chapter.

So while it might be possible to hack [] operator to return the proper type, I'm in general terrified of subclassing pandas. With groupby we got lucky. I think that using .loc/.iloc is also safer when it comes to updating the dataframes (it doesn't modify the original object or something like that).

When I see tb[["..."]] and VSCode warning, I change it to .loc. That said, if you want to take the risk and try to add a support for it, go for it.

Marigold commented 6 days ago

As for other pandas methods that return pd.DataFrame, when I see them I add a method to catalog/tables.py like this

def drop_duplicates(self, *args, **kwargs) -> "Table":
        return super().drop_duplicates(*args, **kwargs)  # type: ignore

Perhaps there's a better solution, but I didn't find it...

larsyencken commented 3 days ago

Some of this may relate to annoying differences between VSCode and command-line linting:

Marigold commented 3 days ago

Related

https://github.com/owid/etl/issues/1113

larsyencken commented 3 days ago

@Marigold To look briefly, but ok to close for now if we can't make quick progress.