Open clamydo opened 2 years ago
That is definitely something that is in scope. But i don't really see how we must provide the types. Do we need the DataFrame
object to accept types?
Something that you might be interested in as well: https://github.com/kolonialno/patito
in spark, there is a notion of a Dataset
which is a strongly typed version of a dataframe.
They are built using scala case classes, so that should be translatable to python or typescript classes easily enough
some pseudo-code.
@pl_dataset # a python annotation to wrap a class into a 'dataset' class
class DName(id: pl.Int64, name: pl.Utf8)
# convert df to ds
dNameDS = df.to_ds(DName)
But they are a runtime type check I assume? Or they only exist in spark (not pyspark). If we would have that at compile time we would have a combinatorial explosion.
AFAIK this is not yet possible in Python and is still a long way from becoming a possibility. It sounds like you are suggesting a making DataFrame
generic over the types of its columns. This immediately presents a few problems.
Axis1 = TypeVar('Axis1')
Axis2 = TypeVar('Axis2')
class Array1(Generic[Axis1]): ...
class Array2(Generic[Axis1, Axis2]): ...
as being cumbersome and leading to a proliferation of classes. For a DataFrame
with a practically infinite number of allowed columns, it obviously would never work.
Unpack
type). This would allow us to do something likefrom typing import Generic, TypeVarTuple
Dtypes = TypeVarTuple('Dtypes')
class DataFrame(Generic[*Dtypes]):
...
df: DataFrame[int, str, float] = pl.DataFrame()
but now suppose our DataFrame
has a method
def column(self, i: int) -> ???
What should that return type be? PEP-646 does not provide a way to annotate a particular index of our variadic generic. The closest mention I see is the section Overloads for Accessing Individual Types but that method would still not helpful.
DataFrame
is via a dictionary, ie.df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
How is mypy
supposed to infer the generic arguments? We would want it to infer something like DataFrame[int, int]
, but our DataFrame
object is defined to accept a dict[str, Sequence[Any]]
in its __init__
method. All the type information is immediately erased. If we used dict[str, Sequence[float | int | str | ...]]
we would still have a problem because there is no way to communicate to mypy
about the order of those sequences. All it knows is that it is a dictionary with string keys and values that may be one of many things.
Although I completely agree it would be an extremely valuable feature, annotating columns is just not yet supported by Python. Would love to hear if someone has thought of a way to accomplish this, though.
One potential option is to follow SQLModel's approach, but for a columnar format - the user declares something like
class MyData(???, pl.DataFrame):
name: Utf8Series
age: IntSeries
...
and the meta ???
takes care of performing the necessary validation of the column types at runtime. Users access MyData
by column names, which contain typed information. Every operation that changes the schema requires a new class
(if the user wants type information).
@matteosantama, these are valid points. Indeed, using variadic generics could be one way of providing the types. However, as @jorgecarleitao proposes, there are other more static ways to express the column types.
Regarding your 2nd issue, accessing a column by a dynamically (at runtime) provided index indeed is difficult to type statically 😉 - there is no way to type that column
function but with a generic Series
type - but that's fine. I think, there is a misconception what level of guarantees a static type checker can and should provide[^1].
A value not know at check time obviously cannot be checked there. But if that is your use case, your types should reflect that openness and restrict what they can (for example, that the return type will be a Series and not str for example)
To check something like that statically would require a static referrence type which is provided statically as well - for example an enum type comes to mind. This would allow, for example, to specify, that a function expects a specific column of a know data frame.
For your issue raised in 3., a type checker actually could infer the type from the type of the provided literals.
What I am after is a gradually opening the black box DataFrame to a type checker to, for example, specify that function expects a data frame with columns of this name/specifier and a particular type - instead of just some arbitrary DataFrame.
Does this make sense?
[^1]: https://lexi-lambda.github.io/blog/2020/01/19/no-dynamic-type-systems-are-not-inherently-more-open/ is a good read
Ok, fair point. Partial typing can be useful even if total typing can't be achieved. But I still see major obstacles.
For your issue raised in 3., a type checker actually could infer the type from the type of the provided literals.
Yes, I agree that the type checker would be able to infer the types from the literals, but it would not be able to determine the order of those types within the DataFrame
.
Which functions do you think could benefit from unordered generics? From where I stand, it's basically just functions that return the same type, for example
.head
.tail
.slice
.take_every
.shrink_to_fit
I think, you are right that variadic/unordered generic is not a good approach. Perhaps an, let's say, ordered type declaration as proposed by @jorgecarleitao seems to be a more promising approach?
Is there any progress on this case?
I'm building full-stack scientific applications where a precise data model is very helpful, and I would like to try combining Polars alongside my Pydantic models.
But then I need to have a way of typing. As a Pydantic user, @jorgecarleitao solution looks to be exactly what I would need to start using Polars.
Is there any progress on this case?
I'm building full-stack scientific applications where a precise data model is very helpful, and I would like to try combining Polars alongside my Pydantic models.
But then I need to have a way of typing. As a Pydantic user, @jorgecarleitao solution looks to be exactly what I would need to start using Polars.
Have you tried https://github.com/kolonialno/patito?
After testing a bit, it seems like patito will not give type completion or dot-notation for accessing columns. Which is, for me, half the problem when using dataframes in large-scale applications.
Is there a fundamental reason why polars do not support dot-notation?
Is there a fundamental reason why polars do not support dot-notation?
FYI: if you're in a Jupyter/IPython notebook, we do now support column autocomplete where possible/feasible: https://github.com/pola-rs/polars/pull/5477.
Is there a fundamental reason why polars do not support dot-notation?
Yes! Ambiguity. Dot notation implies attributes. Attributes should be static. With dot notation it can be attributes or columns. For columns it's also a poor fit as it doesn't even allow you to access all possible names. And lastly you shouldn't be accessing columns via df
much anyway as that is an anti-pattern when dealing with the expression api.
And lastly we can already access columns, so why add yet another way to do so
I'd like to ask, if type hints for column types were or are considered? Something along https://github.com/CedricFR/dataenforce?
With dataenforce it looks like this (from the README):
or
This would allow to statically check properties like the names and types of columns, making it safer to work with DataFrames. Without such type hints they are basically a black box for mypy et al. In my experience, it makes DataFrames in general (also pandas) hard to use in production level code; I've spent days chasing weird bugs associated with pandas (where for example a column names changed or was mistyped).
Is there interest in something like that?