Closed baldwinj30 closed 3 weeks ago
Working on this now
@baldwinj30 can we discuss why this is useful?
I actually see the Series
type used in a DataFrameModel
class definition as too verbose with not much added value, and am considering deprecating it in the pandas validation backend at some point (probably in a pandera v1 when backward incompatible changes can be made).
Happy to discuss reasons to keep this around/propagating it to other backends like the polars one.
help type checkers understand accessing string column names from the class attribute names.
can you expand on this?
Definitely! I really like being able to do something like this in pandas.
import pandera as pa
from pandera.typing import Series
class Model(pa.DataFrameModel):
col_a: Series[int]
df.assign(**{Model.col_a: 1})
and from what I have seen and my understanding, if col_a
was just marked as an int
instead of Series[int]
, mypy will think that Model.col_a
is an int and not a string and raise an error here. I have seen a similar result when trying to use the Polars API, though of course that code looks a little different.
That is really my only reason for wanting this; it is definitely not essential. I think it is very nice though to be able to see exactly what schema my column names go along with right where they are assigned, and not have to jump through extra steps to satisfy the type checker to do it.
I would add if there is a way to get that behavior without adding the Series
tag I would prefer that as well, because I definitely agree that otherwise I would prefer not to use it.
Gotcha, this makes sense.
mypy will think that Model.col_a is an int and not a string and raise an error here
Just for the record, what's the error? Just tried the code locally and mypy doesn't raise any errors for me (might be my local mypy configuration).
I know it's useful, but supporting Model.col_a
to return a string column name has bothered me for some time now because it breaks type-checkers, deviates from a Python user's expectation around attribute access semantics, and doesn't suport cases in which the column name is not a string (int, float, etc).
Is there potentially an alternative syntax that can be more flexible and forgiving on type checkers?
Model.get("col_a") # where `get_name` returns Any type
Where get
can resolve to either the attribute name or alias
class Model(pa.DataFrameModel):
col_a: int
col_2000: float = pa.Field(alias=2000)
Model.get("col_a") # returns "col_a"
Model.get("col_2000") # returns 2000
It's more verbose than Model.<attribute_name>
, but its benefits are:
get
, thus not breaking type checkers.Not suggesting this as an immediate fix (probably a pandera v1 thing when breaking changes happen) but I don't want to propagate potential anti-patterns to new areas of pandera's API surface.
edit: to be clear, this is a separate issue from supporting Series
type annotations, but if the only reason we want to add support for this is to use the Model.col_a
syntax, I think it adds complexity/ambiguity/unneeded choices to the API surface just to enable the attribute access -> str functionality.
We could also solve this problem through documentation/better warning messages
Basically acknowledge the utility of Series[TYPE]
for folks who really love to explicitly type everything. For lazier folks, they'd just need to # type: ignore
lines where they use Model.col_a
where col_a: int
is the type annotation, or refactor their code to use col_a: Series[int]
For the record, the error that I get in the example case (with int
instead of Series[int]
) is: error: Keywords must be strings [misc]
. My full versions are python3.11, mypy1.8.0, pandera 0.19.0b3, pandas 2.2.0, and pandas-stubs 2.1.4.231227; I think with mypy default settings. Of course that is just one example stemming from the type checker thinking that Model.<attribute_name>
is an integer.
And yes that all makes sense to me, totally understand that it goes against what a normal python user might expect. It went against what I expected when I was first thinking about how to use this tbh, but it was kind of a pleasant surprise to me. When I thought about it a little more, I actually thought it made sense that you would not expect to be able to get actual data values from a schema definition, so I could at least justify to myself that Model.<attribute_name>
made sense to return a string, or at the very least an Any
as you mentioned for the case of non-string column names.
Of those options, I would probably choose to keep Series
around and mark all my columns with that since I think it is a little neater than type: ignore
everywhere or using a getter. That said, I am guessing I am probably in the minority on this one; I think it is just my personal preference to avoid string literals as much as possible, but I don't think most dataframe users would much care about being able to do this. I would definitely understand if you decided to just get rid of theSeries
type altogether in a future change.
Thanks for the feedback! It seems you're not alone: https://discord.com/channels/897120336003334214/897554078274555965/1225829033863024791
Let me think about this some more and do a little research.
Just throwing out some more options, would love to get your thoughts:
Another possibility is to define a get_class_attribute_hook
the pandera mypy plugin: https://mypy.readthedocs.io/en/stable/extending_mypy.html#current-list-of-plugin-hooks. This would update the types of field class attributes. The problem with this is additional maintenance and the fact that it wouldn't support other type linters like pyright or pylance.
Model.get.col_name
Create get
accessor property that's effectively the same as Model.get("col_name")
, just syntactic sugar to avoid string literals. The problem with this is that you won't get editor autocompletion on the attribute names.
Okay, having had a few days to mull this over, here's the proposal:
Model.col_a
to access the string column name... hard to take it back now esp. since it's quite useful and convenient.Model.col_a
behavior. For now, I think people who want to use it and want to have consistent mypy typing will just have to use the Series[TYPE]
annotation... for many folks like you, it probably doesn't matter to have a more verbose DataFrameModel
definitions.
Is your feature request related to a problem? Please describe. This would provide a similar
pandera.typing.Series
type for the polars API that exists for pandas and other backends. It is not strictly necessary since it does not get checked, but it would be good to have to match the existing API and to help type checkers understand accessing string column names from the class attribute names.Describe the solution you'd like Allow for annotating polars data frame fields as: