unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Create empty dataframe from schema #992

Open Davidkloving opened 1 year ago

Davidkloving commented 1 year ago

Question about pandera

I need to be able to create an empty dataframe and (maybe) populate it later. I hoped this would be fairly straight forward like np.empty(...) but so far the best way I have found is to write an empty() method for each SchemaModel I have that explicitly creates a pd.DataFrame with manually-maintained columns and dtypes. Have I overlooked something?

cosmicBboy commented 1 year ago

You can do SchemaModel.example(size=0) to create an empty dataframe, via data synthesis strategies

Davidkloving commented 1 year ago

Thanks Neils for taking the time to respond.

I have tried .example(size=0) but I was hoping to accomplish this without introducing hypothesis as a dependency. For some reason it makes our tests very slow to start, sometimes hang, and surprisingly produces the following error:

pandera.errors.SchemaError: expected series 'created_at' to have type datetime64[ns, UTC], got datetime64[ns]

for a column defined as such:

created_at: Optional[Series[Annotated[pd.DatetimeTZDtype, "ns", "UTC"]]]  # type: ignore
cosmicBboy commented 1 year ago

ah, I don't think pandera strategies support pd.DatetimeTZDtype yet.

do you mind opening up a feature request?

Here's a recipe for creating an empty dataframe without the data synthesis strategies:

from typing import Annotated
import pandas as pd
import pandera as pa
from pandera.typing import Series

class Schema(pa.SchemaModel):
    col1: Series[int]
    col2: Series[float]
    col3: Series[str]
    col4: Series[pd.Timestamp]
    col5: Series[Annotated[pd.DatetimeTZDtype, "ns", "UTC"]]

dtypes = {k: str(v) for k, v in Schema.to_schema().dtypes.items()}
empty_df = pd.DataFrame(columns=[*dtypes]).astype(dtypes)

print(empty_df)
print(empty_df.dtypes)

# Output:
# Empty DataFrame
# Columns: [col1, col2, col3, col4, col5]
# Index: []
# col1                  int64
# col2                float64
# col3                 object
# col4         datetime64[ns]
# col5    datetime64[ns, UTC]
# dtype: object

Does this work for you?

Davidkloving commented 1 year ago

Thanks for the suggestion! Yes, this does work. I was able to combine it with a trick I learned from PEP 673 to come up with the following solution which works for Python 3.10 and plays nicely with mypy:

SchemaType = TypeVar("SchemaType", bound="MySchemaModel")

class MySchemaModel(pa.SchemaModel):
    """
    Provides a `pandera.SchemaModel` with convenience function for generating empty
    dataframes that fit the schema.
    """

    @classmethod
    def empty(cls) -> DataFrame[SchemaType]:
        dtypes = {k: str(v) for k, v in cls.to_schema().dtypes.items()}
        empty_df = pd.DataFrame(columns=[*dtypes]).astype(dtypes)
        return DataFrame[SchemaType](empty_df)

Is this something that we could add to Pandera itself? I'm sure a .empty() would be useful to many people.

cosmicBboy commented 1 year ago

Great! Yes would welcome a PR on this. One note on the approach: we should add an empty method to DataFrameSchema, which implements basically the first 2 lines of your empty method, which would basically be used by SchemaModel.empty to created the typed DataFrame[SchemaType] dataframe. Converting this issue to an enhancement

a-recknagel commented 1 year ago

Hi, I like the feature and got impatient, so I started working on a PR. A few question @cosmicBboy , while writing tests I ran into some issues.

These types can't by instantiated as a dtype during the astype call, at least some of them due to being too abstract. But can all of these be safely dropped from the test?

pandera.dtypes.DataType
pandera.dtypes._Number
pandera.dtypes._PhysicalNumber
pandera.engines.numpy_engine.DataType
pandera.engines.pandas_engine.DataType
pandera.engines.pandas_engine.Period
pandera.engines.pandas_engine.Interval
pandera.engines.pandas_engine.PydanticModel

And these two failed a subsequent validate call by the schema that defined the dtype:

I'll look into them a bit more, but I'm hoping you could tell me right away what the issue might be.


edit: I built my test-schema like this:

schema = pandera.DataFrameSchema(columns={
    "pandera.dtypes.DataType": pandera.Column(pandera.dtypes.DataType),
    "pandera.dtypes._Number": pandera.Column(pandera.dtypes._Number),
    "pandera.dtypes._PhysicalNumber": pandera.Column(pandera.dtypes._PhysicalNumber),
    "pandera.dtypes.Int": pandera.Column(pandera.dtypes.Int),
    "pandera.dtypes.Int64": pandera.Column(pandera.dtypes.Int64),
    ... etc
})

I hope that's how I'm supposed to do it.

cosmicBboy commented 1 year ago

Thanks @a-recknagel !

Not sure what implementation you opted for, but revisiting the code snippet I posted above, I think a more robust approach would be:

schema = Schema.to_schema()
schema.coerce = True
empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))

This piggy-backs on pandera's coercion logic, and you should be able to use the pandera DataType subclasses in your test.

But can all of these be safely dropped from the test?

I'd ignore the abstract DataTypes, (basically test dtypes supported by pandas_engine... pandas-supported types like Period and Interval should be included.

PydanticModel is an interesting case, I'm not sure, but I don't think it'll work given that it needs rows to make coercion work. I'd recommend special-casing a TypeError when using empty() with the PydanticModel for now.

a-recknagel commented 1 year ago

I created a PR from my in-progress branch, changing the way the empty dataframe is created to leverage coercion didn't seem to change the failing cases.

PydanticModel is an interesting case, I'm not sure, but I don't think it'll work given that it needs rows to make coercion work. I'd recommend special-casing a TypeError when using empty() with the PydanticModel for now.

You mean within the empty function, right? I'll try tomorrow.

ssuffian commented 1 year ago

I know this is an old thread, but I came across it and it mostly worked for me except it wasn't preserving an index field. I added a line index=pd.Index([],name=schema.index.name, dtype=schema.index.dtype.type) to get it to work:

class TestDf(pa.DataFrameModel):
    dt: Index[datetime] = pa.Field(check_name=True)
    col1: Series[int]
    col2: Series[int]
    col3: Series[int]

    @classmethod
    def empty(cls):
        schema = cls.to_schema()
        schema.coerce = True
        index = pd.Index([],name=schema.index.name, dtype=schema.index.dtype.type)
        empty_df =schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns], index=index))
        return empty_df
a-recknagel commented 1 year ago

@ssuffian My time is bound up elsewhere so I can't review this right now. If you want, you can cherry pick the changes from my branch and take over though.