unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

missing from_record mehtod wich resturns DataFrame[Schema] #850

Closed borissmidt closed 1 year ago

borissmidt commented 2 years ago

Is your feature request related to a problem? Please describe. When i do panderas.typing.DataFrame[T].from_record() i get an untyped DataFrame back and not a panderas DataFrame[T]

Describe the solution you'd like A from_record method which allows you to create a typed dataframe this is especially usefull for writing unit tests.

borissmidt commented 2 years ago

another improvement might be to have a typed 'record' constructor which matches the columns, but i'm not sure how you can make the IDE pick this up.

cosmicBboy commented 2 years ago

A from_record method which allows you to create a typed dataframe this is especially usefull for writing unit tests.

I'm open to supporting this!

Basically the pandera DataFrame type would need to override the from_record method by calling the super().from_record method and then typing.cast the output of that to "typing.pandas.DataFrame[T]" (a self referencing return annotation).

Let me know if you have the capacity to make a PR for this! would be happy to help guide further.

In the meantime, a workaround would be something like:

pa.typing.DataFrame[Schema](pd.DataFrame.from_record(...))
borissmidt commented 2 years ago

I made a workaround but it isn't ideal, since the schemas by default don't keep the names of the indexes. Otherwise each time you do a from_records() call you have to specify which indexes are used in that data type. I also could get to the 'generic' parameter of the schema inside the DataFrame type. (i'm kind of new to python, but programmed scala before where you can actually acces these things).

class ExtDataFrame(pat.DataFrame[TSchemaModel]):
    schema: TSchemaModel

    def __init__(self, t: typing.Type[TSchemaModel]):
        self.schema = t

    def from_records(  # type: ignore
        self: typing.Type[TSchemaModel],
        data,
        index=None,
        exclude=None,
        columns=None,
        coerce_float: bool = False,
        nrows: int | None = None,
    ) -> pat.DataFrame[TSchemaModel]:
       schema= self.schema.to_schema()
       index = schema.index.names if index is None else index
        return self.schema.validate(
            pat.DataFrame[TSchemaModel](
                pat.DataFrame[TSchemaModel].from_records(
                    data=data,
                    index=index,
                    exclude=exclude,
                    columns=columns,
                    coerce_float=coerce_float,
                    nrows=nrows,
                )
            )
        )

# then you can do:
ExtDataFrame(Schema).from_records([{"col1", 1, "idx": 2}])
# in most of my code i do:
ExtDataFrame(Schema).from_records([Schema.record(col1= 1, idx= 2)])

OffTopic:

cosmicBboy commented 2 years ago

I'm not quite clear on your use case here... would you mind elaborating on that and why you need strictly typed dataframes? Example tests/code that you're using would help. I ask mainly because this part of the pandera functionality is still experimental, and people who need strictly typed dataframes might see some rough edges, as you have 🙂

I made a workaround but it isn't ideal, since the schemas by default don't keep the names of the indexes. Otherwise each time you do a from_records() call you have to specify which indexes are used in that data type.

You can supply check_name to the Field associated with your index: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.model_components.Field.html#pandera.model_components.Field

I just noticed that panderas doesn't allow the IDE (pycharm) to easily refactor collumn names since the IDE doesn't understand that the collumns in pandas.typing.DataFrame are defined by a generic type.

This is a known limitation of pandera... we haven't yet explored ways of modifying the pandera.typing.DataFrame class to make it aware of the columns/indexes defined in the schema, as this adds more complexity to the pandera DataFrame subclass, e.g. we'd need to deal with conflicts between pandas.DataFrame methods/attributes and user-defined column names. I understand this isn't ideal from a IDE autocompletion perspective, but why does this make refactoring column names hard?

also pa.typing.DataFrame[Schema] doesn't seem to validate the columns. i have to explicitly call Schema.validate and had to extend the schema to return a pandera.pandas.DataFrame instead of pandera.BaseDataFrame

Can you provide a minimally reproducible code snippet for this in a bug issue? This test makes sure validation of columns does indeed happen

borissmidt commented 2 years ago

Yes, i like typed data frames, because it is really good for documenting the code so you don't make any errors in column names or type errors. it also catches a lot of problems in case of missing data.

Another usage i made out of it is to use it as a definition of my xlsx report output. I use the title in the field to actually set the title in the xlsx report output. and use reflection on the schema to get the right columns for serialization. This makes it very easy to change the order of columns in the output format and change the output itself. In case of a missing column the code would fail at the function that calculates the data instead of having to manually check the output file.

for example:

# Just extends the SchemaModel
class MonthlySummary(SchemaModelXlsx):
    @classmethod
    @property
    def sheet_name(cls) -> str: #this could be title in the config instead
        return "Summary"

    month: pat.Index[pat.DateTime] = pa.Field(check_name=True, title="Month")
    bruto_revenue: pat.Series[float] = pa.Field(title="Bruto revenue")
    expenses: pat.Series[float] = pa.Field(title="expenses")
    netto_revenue: pat.Series[float] = pa.Field(title="Netto revenue")

def revenue(sales: pat.DataFrame[ProductSales], services: pat.DataFrame[ServiceSales]) -> pat.DataFrame[Revenue]:
     pass

def montly_summary(bruto_revenue: pat.DataFrame[Revenue], expenses_per_day: pat.DataFrame[Expenses]  ):
     # reindexes and then makes a difference between the different types of revenue and expenses.

     return pat.DataFrame[MonthlySummary](
        {
           "bruto_revenue": total_revenue
           "expenses": total_expenses
           "netto_revenue": total_revenue - total_expenses
        }
     )

ideally a typed api dataframe should have a constructor so you could call

# each field should be typed to make construction easy
MonthlySummary(month, bruto_reveneu, espenses, netto_revenue)
or if you want to extract it from a df, this could drop the 'unstated columns' to enforce you to not just add some data.
MonthlySummary.from_df(df)

having this specialized types could also add the opportunity to add methods and properties to the data frames to make calculating aggregated data easy with the defined types.

Can you provide a minimally reproducible code snippet for this in a bug issue? This test makes sure validation of columns does indeed happen

Looking at the test it doesn't check for missing columns i'll try to spend some time today to make an sample to double check the problem.

borissmidt commented 2 years ago

Oke it is only the from_records that doesn't do any checks. But i only use it in my unit tests.

class MonthlySummary2(SchemaModel):
    month: pat.Index[int] = pa.Field(check_name=True, title="Month")
    bruto_revenue: pat.Series[float] = pa.Field(title="Bruto revenue")
    expenses: pat.Series[float] = pa.Field(title="expenses")
    netto_revenue: pat.Series[float] = pa.Field(title="Netto revenue")

        df = pat.DataFrame[MonthlySummary2].from_records(
            [
                {
                    "month": 1,
                    "bruto_revenue": 1.0,
                    "expenses": 2.0
                }
            ],
            index=["month"]
        )
cosmicBboy commented 2 years ago

Hi @borissmidt

Looking at the test it doesn't check for missing columns i'll try to spend some time today to make an sample to double check the problem.

Yeah if you can send a code snippet (or maybe a PR 🙂) to update that test that would be great!

Oke it is only the from_records that doesn't do any checks. But i only use it in my unit tests.

I'm down to support this use case, but I'm currently working on other stuff (#381) so if you'd like to own that part of the codebase I can help review changes and get them merged into the core library.

borissmidt commented 2 years ago

I will spend some time to make a PR.

On Tue, 10 May 2022, 03:24 Niels Bantilan, @.***> wrote:

Hi @borissmidt https://github.com/borissmidt

Looking at the test it doesn't check for missing columns i'll try to spend some time today to make an sample to double check the problem.

Yeah if you can send a code snippet (or maybe a PR 🙂) to update that test that would be great!

Oke it is only the from_records that doesn't do any checks. But i only use it in my unit tests.

I'm down to support this use case, but I'm currently working on other stuff (#381 https://github.com/pandera-dev/pandera/issues/381) so if you'd like to own that part of the codebase I can help review changes and get them merged into the core library.

— Reply to this email directly, view it on GitHub https://github.com/pandera-dev/pandera/issues/850#issuecomment-1121767673, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNXZFRJPRUIT3EN3R4PN5TVJG3GPANCNFSM5VHQG4UQ . You are receiving this because you were mentioned.Message ID: @.***>

cosmicBboy commented 1 year ago

fixed by #859