unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.05k stars 281 forks source link

Add a polars `Series` type #1594

Closed baldwinj30 closed 3 weeks ago

baldwinj30 commented 4 weeks ago

Is your feature request related to a problem? Please describe. This would provide a similar pandera.typing.Series type for the polars API that exists for pandas and other backends. It is not strictly necessary since it does not get checked, but it would be good to have to match the existing API and to help type checkers understand accessing string column names from the class attribute names.

Describe the solution you'd like Allow for annotating polars data frame fields as:

import pandera.polars as pa
from pandera.typing.polars import Series

class MySchema(pa.DataFrameModel):
    a: Series[int]
baldwinj30 commented 4 weeks ago

Working on this now

cosmicBboy commented 4 weeks ago

@baldwinj30 can we discuss why this is useful?

I actually see the Series type used in a DataFrameModel class definition as too verbose with not much added value, and am considering deprecating it in the pandas validation backend at some point (probably in a pandera v1 when backward incompatible changes can be made).

Happy to discuss reasons to keep this around/propagating it to other backends like the polars one.

cosmicBboy commented 4 weeks ago

help type checkers understand accessing string column names from the class attribute names.

can you expand on this?

baldwinj30 commented 4 weeks ago

Definitely! I really like being able to do something like this in pandas.

import pandera as pa
from pandera.typing import Series

class Model(pa.DataFrameModel):
    col_a: Series[int]

df.assign(**{Model.col_a: 1})

and from what I have seen and my understanding, if col_a was just marked as an int instead of Series[int], mypy will think that Model.col_a is an int and not a string and raise an error here. I have seen a similar result when trying to use the Polars API, though of course that code looks a little different.

That is really my only reason for wanting this; it is definitely not essential. I think it is very nice though to be able to see exactly what schema my column names go along with right where they are assigned, and not have to jump through extra steps to satisfy the type checker to do it.

baldwinj30 commented 4 weeks ago

I would add if there is a way to get that behavior without adding the Series tag I would prefer that as well, because I definitely agree that otherwise I would prefer not to use it.

cosmicBboy commented 4 weeks ago

Gotcha, this makes sense.

mypy will think that Model.col_a is an int and not a string and raise an error here

Just for the record, what's the error? Just tried the code locally and mypy doesn't raise any errors for me (might be my local mypy configuration).

I know it's useful, but supporting Model.col_a to return a string column name has bothered me for some time now because it breaks type-checkers, deviates from a Python user's expectation around attribute access semantics, and doesn't suport cases in which the column name is not a string (int, float, etc).

Is there potentially an alternative syntax that can be more flexible and forgiving on type checkers?

Model.get("col_a")  # where `get_name` returns Any type

Where get can resolve to either the attribute name or alias

class Model(pa.DataFrameModel):
    col_a: int
    col_2000: float = pa.Field(alias=2000)

Model.get("col_a")  # returns "col_a"
Model.get("col_2000")  # returns 2000

It's more verbose than Model.<attribute_name>, but its benefits are:

Not suggesting this as an immediate fix (probably a pandera v1 thing when breaking changes happen) but I don't want to propagate potential anti-patterns to new areas of pandera's API surface.

edit: to be clear, this is a separate issue from supporting Series type annotations, but if the only reason we want to add support for this is to use the Model.col_a syntax, I think it adds complexity/ambiguity/unneeded choices to the API surface just to enable the attribute access -> str functionality.

cosmicBboy commented 4 weeks ago

We could also solve this problem through documentation/better warning messages

Basically acknowledge the utility of Series[TYPE] for folks who really love to explicitly type everything. For lazier folks, they'd just need to # type: ignore lines where they use Model.col_a where col_a: int is the type annotation, or refactor their code to use col_a: Series[int]

baldwinj30 commented 4 weeks ago

For the record, the error that I get in the example case (with int instead of Series[int]) is: error: Keywords must be strings [misc]. My full versions are python3.11, mypy1.8.0, pandera 0.19.0b3, pandas 2.2.0, and pandas-stubs 2.1.4.231227; I think with mypy default settings. Of course that is just one example stemming from the type checker thinking that Model.<attribute_name> is an integer.

And yes that all makes sense to me, totally understand that it goes against what a normal python user might expect. It went against what I expected when I was first thinking about how to use this tbh, but it was kind of a pleasant surprise to me. When I thought about it a little more, I actually thought it made sense that you would not expect to be able to get actual data values from a schema definition, so I could at least justify to myself that Model.<attribute_name> made sense to return a string, or at the very least an Any as you mentioned for the case of non-string column names.

Of those options, I would probably choose to keep Series around and mark all my columns with that since I think it is a little neater than type: ignore everywhere or using a getter. That said, I am guessing I am probably in the minority on this one; I think it is just my personal preference to avoid string literals as much as possible, but I don't think most dataframe users would much care about being able to do this. I would definitely understand if you decided to just get rid of theSeries type altogether in a future change.

cosmicBboy commented 4 weeks ago

Thanks for the feedback! It seems you're not alone: https://discord.com/channels/897120336003334214/897554078274555965/1225829033863024791

Let me think about this some more and do a little research.

Just throwing out some more options, would love to get your thoughts:

mypy plugin updates

Another possibility is to define a get_class_attribute_hook the pandera mypy plugin: https://mypy.readthedocs.io/en/stable/extending_mypy.html#current-list-of-plugin-hooks. This would update the types of field class attributes. The problem with this is additional maintenance and the fact that it wouldn't support other type linters like pyright or pylance.

Model.get.col_name

Create get accessor property that's effectively the same as Model.get("col_name"), just syntactic sugar to avoid string literals. The problem with this is that you won't get editor autocompletion on the attribute names.

cosmicBboy commented 3 weeks ago

Okay, having had a few days to mull this over, here's the proposal:

  1. Let's go ahead and get https://github.com/unionai-oss/pandera/pull/1595 merged. I think it's more important to have a consistent API across validation backends, acknowledging what I'd consider the mistake of supporting Model.col_a to access the string column name... hard to take it back now esp. since it's quite useful and convenient.
  2. Will kick the can down the road in the decision of deprecating / removing the current Model.col_a behavior. For now, I think people who want to use it and want to have consistent mypy typing will just have to use the Series[TYPE] annotation... for many folks like you, it probably doesn't matter to have a more verbose DataFrameModel definitions.