pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.22k stars 1.84k forks source link

Python column type hints #3623

Open clamydo opened 2 years ago

clamydo commented 2 years ago

I'd like to ask, if type hints for column types were or are considered? Something along https://github.com/CedricFR/dataenforce?

With dataenforce it looks like this (from the README):

def process_data(data: Dataset["id": int, "name": object, "latitude": float, "longitude": float])
  pass

or

DName = Dataset["id", "name"]
DLocation = Dataset["id", "latitude", "longitude"]

# Expects columns id, name
def process1(data: DName):
  pass

# Expects columns id, name, latitude, longitude, timestamp
def process2(data: Dataset[DName, DLocation, "timestamp"])
  pass

This would allow to statically check properties like the names and types of columns, making it safer to work with DataFrames. Without such type hints they are basically a black box for mypy et al. In my experience, it makes DataFrames in general (also pandas) hard to use in production level code; I've spent days chasing weird bugs associated with pandas (where for example a column names changed or was mistyped).

Is there interest in something like that?

ritchie46 commented 2 years ago

That is definitely something that is in scope. But i don't really see how we must provide the types. Do we need the DataFrame object to accept types?

Something that you might be interested in as well: https://github.com/kolonialno/patito

universalmind303 commented 2 years ago

in spark, there is a notion of a Dataset which is a strongly typed version of a dataframe.

They are built using scala case classes, so that should be translatable to python or typescript classes easily enough

some pseudo-code.

@pl_dataset # a python annotation to wrap a class into a 'dataset' class
class DName(id: pl.Int64, name: pl.Utf8)

# convert df to ds
dNameDS = df.to_ds(DName)
ritchie46 commented 2 years ago

But they are a runtime type check I assume? Or they only exist in spark (not pyspark). If we would have that at compile time we would have a combinatorial explosion.

matteosantama commented 2 years ago

AFAIK this is not yet possible in Python and is still a long way from becoming a possibility. It sounds like you are suggesting a making DataFrame generic over the types of its columns. This immediately presents a few problems.

  1. Python does not yet support variable number of generic parameters. There is an accepted PEP regarding variadic generics but it's not expected to land until Python 3.11. In that PEP, they mention the current way of supporting variadic generics
Axis1 = TypeVar('Axis1')
Axis2 = TypeVar('Axis2')

class Array1(Generic[Axis1]): ...

class Array2(Generic[Axis1, Axis2]): ...

as being cumbersome and leading to a proliferation of classes. For a DataFrame with a practically infinite number of allowed columns, it obviously would never work.

  1. Suppose we wait for variadic generics to land (or use the back-ported Unpack type). This would allow us to do something like
from typing import Generic, TypeVarTuple

Dtypes = TypeVarTuple('Dtypes')

class DataFrame(Generic[*Dtypes]): 
    ...

df: DataFrame[int, str, float] = pl.DataFrame()

but now suppose our DataFrame has a method

def column(self, i: int) -> ???

What should that return type be? PEP-646 does not provide a way to annotate a particular index of our variadic generic. The closest mention I see is the section Overloads for Accessing Individual Types but that method would still not helpful.

  1. Finally, recall that a popular way of constructing a DataFrame is via a dictionary, ie.
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

How is mypy supposed to infer the generic arguments? We would want it to infer something like DataFrame[int, int], but our DataFrame object is defined to accept a dict[str, Sequence[Any]] in its __init__ method. All the type information is immediately erased. If we used dict[str, Sequence[float | int | str | ...]] we would still have a problem because there is no way to communicate to mypy about the order of those sequences. All it knows is that it is a dictionary with string keys and values that may be one of many things.

Although I completely agree it would be an extremely valuable feature, annotating columns is just not yet supported by Python. Would love to hear if someone has thought of a way to accomplish this, though.

jorgecarleitao commented 2 years ago

One potential option is to follow SQLModel's approach, but for a columnar format - the user declares something like

class MyData(???, pl.DataFrame):
    name: Utf8Series
    age: IntSeries
    ...

and the meta ??? takes care of performing the necessary validation of the column types at runtime. Users access MyData by column names, which contain typed information. Every operation that changes the schema requires a new class (if the user wants type information).

clamydo commented 2 years ago

@matteosantama, these are valid points. Indeed, using variadic generics could be one way of providing the types. However, as @jorgecarleitao proposes, there are other more static ways to express the column types.

Regarding your 2nd issue, accessing a column by a dynamically (at runtime) provided index indeed is difficult to type statically 😉 - there is no way to type that column function but with a generic Series type - but that's fine. I think, there is a misconception what level of guarantees a static type checker can and should provide[^1].

A value not know at check time obviously cannot be checked there. But if that is your use case, your types should reflect that openness and restrict what they can (for example, that the return type will be a Series and not str for example)

To check something like that statically would require a static referrence type which is provided statically as well - for example an enum type comes to mind. This would allow, for example, to specify, that a function expects a specific column of a know data frame.

For your issue raised in 3., a type checker actually could infer the type from the type of the provided literals.

What I am after is a gradually opening the black box DataFrame to a type checker to, for example, specify that function expects a data frame with columns of this name/specifier and a particular type - instead of just some arbitrary DataFrame.

Does this make sense?

[^1]: https://lexi-lambda.github.io/blog/2020/01/19/no-dynamic-type-systems-are-not-inherently-more-open/ is a good read

matteosantama commented 2 years ago

Ok, fair point. Partial typing can be useful even if total typing can't be achieved. But I still see major obstacles.

For your issue raised in 3., a type checker actually could infer the type from the type of the provided literals.

Yes, I agree that the type checker would be able to infer the types from the literals, but it would not be able to determine the order of those types within the DataFrame.

Which functions do you think could benefit from unordered generics? From where I stand, it's basically just functions that return the same type, for example

clamydo commented 2 years ago

I think, you are right that variadic/unordered generic is not a good approach. Perhaps an, let's say, ordered type declaration as proposed by @jorgecarleitao seems to be a more promising approach?

wholmen commented 1 year ago

Is there any progress on this case?

I'm building full-stack scientific applications where a precise data model is very helpful, and I would like to try combining Polars alongside my Pydantic models.

But then I need to have a way of typing. As a Pydantic user, @jorgecarleitao solution looks to be exactly what I would need to start using Polars.

ritchie46 commented 1 year ago

Is there any progress on this case?

I'm building full-stack scientific applications where a precise data model is very helpful, and I would like to try combining Polars alongside my Pydantic models.

But then I need to have a way of typing. As a Pydantic user, @jorgecarleitao solution looks to be exactly what I would need to start using Polars.

Have you tried https://github.com/kolonialno/patito?

wholmen commented 1 year ago

After testing a bit, it seems like patito will not give type completion or dot-notation for accessing columns. Which is, for me, half the problem when using dataframes in large-scale applications.

Is there a fundamental reason why polars do not support dot-notation?

alexander-beedie commented 1 year ago

Is there a fundamental reason why polars do not support dot-notation?

FYI: if you're in a Jupyter/IPython notebook, we do now support column autocomplete where possible/feasible: https://github.com/pola-rs/polars/pull/5477.

ritchie46 commented 1 year ago

Is there a fundamental reason why polars do not support dot-notation?

Yes! Ambiguity. Dot notation implies attributes. Attributes should be static. With dot notation it can be attributes or columns. For columns it's also a poor fit as it doesn't even allow you to access all possible names. And lastly you shouldn't be accessing columns via df much anyway as that is an anti-pattern when dealing with the expression api.

And lastly we can already access columns, so why add yet another way to do so