Closed borissmidt closed 1 year ago
another improvement might be to have a typed 'record' constructor which matches the columns, but i'm not sure how you can make the IDE pick this up.
A from_record method which allows you to create a typed dataframe this is especially usefull for writing unit tests.
I'm open to supporting this!
Basically the pandera DataFrame
type would need to override the from_record
method by calling the super().from_record
method and then typing.cast
the output of that to "typing.pandas.DataFrame[T]"
(a self referencing return annotation).
Let me know if you have the capacity to make a PR for this! would be happy to help guide further.
In the meantime, a workaround would be something like:
pa.typing.DataFrame[Schema](pd.DataFrame.from_record(...))
I made a workaround but it isn't ideal, since the schemas by default don't keep the names of the indexes. Otherwise each time you do a from_records() call you have to specify which indexes are used in that data type. I also could get to the 'generic' parameter of the schema inside the DataFrame type. (i'm kind of new to python, but programmed scala before where you can actually acces these things).
class ExtDataFrame(pat.DataFrame[TSchemaModel]):
schema: TSchemaModel
def __init__(self, t: typing.Type[TSchemaModel]):
self.schema = t
def from_records( # type: ignore
self: typing.Type[TSchemaModel],
data,
index=None,
exclude=None,
columns=None,
coerce_float: bool = False,
nrows: int | None = None,
) -> pat.DataFrame[TSchemaModel]:
schema= self.schema.to_schema()
index = schema.index.names if index is None else index
return self.schema.validate(
pat.DataFrame[TSchemaModel](
pat.DataFrame[TSchemaModel].from_records(
data=data,
index=index,
exclude=exclude,
columns=columns,
coerce_float=coerce_float,
nrows=nrows,
)
)
)
# then you can do:
ExtDataFrame(Schema).from_records([{"col1", 1, "idx": 2}])
# in most of my code i do:
ExtDataFrame(Schema).from_records([Schema.record(col1= 1, idx= 2)])
OffTopic:
pandas.typing.DataFrame
are defined by a generic type. pa.typing.DataFrame[Schema]
doesn't seem to validate the columns. i have to explicitly call Schema.validate
and had to extend the schema to return a pandera.pandas.DataFrame
instead of pandera.BaseDataFrame
I'm not quite clear on your use case here... would you mind elaborating on that and why you need strictly typed dataframes? Example tests/code that you're using would help. I ask mainly because this part of the pandera functionality is still experimental, and people who need strictly typed dataframes might see some rough edges, as you have 🙂
I made a workaround but it isn't ideal, since the schemas by default don't keep the names of the indexes. Otherwise each time you do a from_records() call you have to specify which indexes are used in that data type.
You can supply check_name
to the Field
associated with your index: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.model_components.Field.html#pandera.model_components.Field
I just noticed that panderas doesn't allow the IDE (pycharm) to easily refactor collumn names since the IDE doesn't understand that the collumns in pandas.typing.DataFrame are defined by a generic type.
This is a known limitation of pandera... we haven't yet explored ways of modifying the pandera.typing.DataFrame
class to make it aware of the columns/indexes defined in the schema, as this adds more complexity to the pandera DataFrame subclass, e.g. we'd need to deal with conflicts between pandas.DataFrame methods/attributes and user-defined column names. I understand this isn't ideal from a IDE autocompletion perspective, but why does this make refactoring column names hard?
also pa.typing.DataFrame[Schema] doesn't seem to validate the columns. i have to explicitly call Schema.validate and had to extend the schema to return a pandera.pandas.DataFrame instead of pandera.BaseDataFrame
Can you provide a minimally reproducible code snippet for this in a bug issue? This test makes sure validation of columns does indeed happen
Yes, i like typed data frames, because it is really good for documenting the code so you don't make any errors in column names or type errors. it also catches a lot of problems in case of missing data.
Another usage i made out of it is to use it as a definition of my xlsx report output. I use the title in the field to actually set the title in the xlsx report output. and use reflection on the schema to get the right columns for serialization. This makes it very easy to change the order of columns in the output format and change the output itself. In case of a missing column the code would fail at the function that calculates the data instead of having to manually check the output file.
for example:
# Just extends the SchemaModel
class MonthlySummary(SchemaModelXlsx):
@classmethod
@property
def sheet_name(cls) -> str: #this could be title in the config instead
return "Summary"
month: pat.Index[pat.DateTime] = pa.Field(check_name=True, title="Month")
bruto_revenue: pat.Series[float] = pa.Field(title="Bruto revenue")
expenses: pat.Series[float] = pa.Field(title="expenses")
netto_revenue: pat.Series[float] = pa.Field(title="Netto revenue")
def revenue(sales: pat.DataFrame[ProductSales], services: pat.DataFrame[ServiceSales]) -> pat.DataFrame[Revenue]:
pass
def montly_summary(bruto_revenue: pat.DataFrame[Revenue], expenses_per_day: pat.DataFrame[Expenses] ):
# reindexes and then makes a difference between the different types of revenue and expenses.
return pat.DataFrame[MonthlySummary](
{
"bruto_revenue": total_revenue
"expenses": total_expenses
"netto_revenue": total_revenue - total_expenses
}
)
ideally a typed api dataframe should have a constructor so you could call
# each field should be typed to make construction easy
MonthlySummary(month, bruto_reveneu, espenses, netto_revenue)
or if you want to extract it from a df, this could drop the 'unstated columns' to enforce you to not just add some data.
MonthlySummary.from_df(df)
having this specialized types could also add the opportunity to add methods and properties to the data frames to make calculating aggregated data easy with the defined types.
Can you provide a minimally reproducible code snippet for this in a bug issue? This test makes sure validation of columns does indeed happen
Looking at the test it doesn't check for missing columns i'll try to spend some time today to make an sample to double check the problem.
Oke it is only the from_records
that doesn't do any checks. But i only use it in my unit tests.
class MonthlySummary2(SchemaModel):
month: pat.Index[int] = pa.Field(check_name=True, title="Month")
bruto_revenue: pat.Series[float] = pa.Field(title="Bruto revenue")
expenses: pat.Series[float] = pa.Field(title="expenses")
netto_revenue: pat.Series[float] = pa.Field(title="Netto revenue")
df = pat.DataFrame[MonthlySummary2].from_records(
[
{
"month": 1,
"bruto_revenue": 1.0,
"expenses": 2.0
}
],
index=["month"]
)
Hi @borissmidt
Looking at the test it doesn't check for missing columns i'll try to spend some time today to make an sample to double check the problem.
Yeah if you can send a code snippet (or maybe a PR 🙂) to update that test that would be great!
Oke it is only the from_records that doesn't do any checks. But i only use it in my unit tests.
I'm down to support this use case, but I'm currently working on other stuff (#381) so if you'd like to own that part of the codebase I can help review changes and get them merged into the core library.
I will spend some time to make a PR.
On Tue, 10 May 2022, 03:24 Niels Bantilan, @.***> wrote:
Hi @borissmidt https://github.com/borissmidt
Looking at the test it doesn't check for missing columns i'll try to spend some time today to make an sample to double check the problem.
Yeah if you can send a code snippet (or maybe a PR 🙂) to update that test that would be great!
Oke it is only the from_records that doesn't do any checks. But i only use it in my unit tests.
I'm down to support this use case, but I'm currently working on other stuff (#381 https://github.com/pandera-dev/pandera/issues/381) so if you'd like to own that part of the codebase I can help review changes and get them merged into the core library.
— Reply to this email directly, view it on GitHub https://github.com/pandera-dev/pandera/issues/850#issuecomment-1121767673, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNXZFRJPRUIT3EN3R4PN5TVJG3GPANCNFSM5VHQG4UQ . You are receiving this because you were mentioned.Message ID: @.***>
fixed by #859
Is your feature request related to a problem? Please describe. When i do
panderas.typing.DataFrame[T].from_record()
i get an untyped DataFrame back and not a panderasDataFrame[T]
Describe the solution you'd like A from_record method which allows you to create a typed dataframe this is especially usefull for writing unit tests.