Open Davidkloving opened 1 year ago
You can do SchemaModel.example(size=0)
to create an empty dataframe, via data synthesis strategies
Thanks Neils for taking the time to respond.
I have tried .example(size=0)
but I was hoping to accomplish this without introducing hypothesis
as a dependency. For some reason it makes our tests very slow to start, sometimes hang, and surprisingly produces the following error:
pandera.errors.SchemaError: expected series 'created_at' to have type datetime64[ns, UTC], got datetime64[ns]
for a column defined as such:
created_at: Optional[Series[Annotated[pd.DatetimeTZDtype, "ns", "UTC"]]] # type: ignore
ah, I don't think pandera strategies support pd.DatetimeTZDtype
yet.
do you mind opening up a feature request?
Here's a recipe for creating an empty dataframe without the data synthesis strategies:
from typing import Annotated
import pandas as pd
import pandera as pa
from pandera.typing import Series
class Schema(pa.SchemaModel):
col1: Series[int]
col2: Series[float]
col3: Series[str]
col4: Series[pd.Timestamp]
col5: Series[Annotated[pd.DatetimeTZDtype, "ns", "UTC"]]
dtypes = {k: str(v) for k, v in Schema.to_schema().dtypes.items()}
empty_df = pd.DataFrame(columns=[*dtypes]).astype(dtypes)
print(empty_df)
print(empty_df.dtypes)
# Output:
# Empty DataFrame
# Columns: [col1, col2, col3, col4, col5]
# Index: []
# col1 int64
# col2 float64
# col3 object
# col4 datetime64[ns]
# col5 datetime64[ns, UTC]
# dtype: object
Does this work for you?
Thanks for the suggestion! Yes, this does work. I was able to combine it with a trick I learned from PEP 673 to come up with the following solution which works for Python 3.10 and plays nicely with mypy:
SchemaType = TypeVar("SchemaType", bound="MySchemaModel")
class MySchemaModel(pa.SchemaModel):
"""
Provides a `pandera.SchemaModel` with convenience function for generating empty
dataframes that fit the schema.
"""
@classmethod
def empty(cls) -> DataFrame[SchemaType]:
dtypes = {k: str(v) for k, v in cls.to_schema().dtypes.items()}
empty_df = pd.DataFrame(columns=[*dtypes]).astype(dtypes)
return DataFrame[SchemaType](empty_df)
Is this something that we could add to Pandera itself? I'm sure a .empty()
would be useful to many people.
Great! Yes would welcome a PR on this. One note on the approach: we should add an empty
method to DataFrameSchema
, which implements basically the first 2 lines of your empty
method, which would basically be used by SchemaModel.empty
to created the typed DataFrame[SchemaType]
dataframe. Converting this issue to an enhancement
Hi, I like the feature and got impatient, so I started working on a PR. A few question @cosmicBboy , while writing tests I ran into some issues.
These types can't by instantiated as a dtype during the astype
call, at least some of them due to being too abstract. But can all of these be safely dropped from the test?
pandera.dtypes.DataType
pandera.dtypes._Number
pandera.dtypes._PhysicalNumber
pandera.engines.numpy_engine.DataType
pandera.engines.pandas_engine.DataType
pandera.engines.pandas_engine.Period
pandera.engines.pandas_engine.Interval
pandera.engines.pandas_engine.PydanticModel
And these two failed a subsequent validate
call by the schema that defined the dtype:
pandera.engines.numpy_engine.DateTime64
-- Expected type datetime64, got datetime64[ns] pandera.engines.numpy_engine.Bytes
-- Data type 'bytes8' not understoodI'll look into them a bit more, but I'm hoping you could tell me right away what the issue might be.
edit: I built my test-schema like this:
schema = pandera.DataFrameSchema(columns={
"pandera.dtypes.DataType": pandera.Column(pandera.dtypes.DataType),
"pandera.dtypes._Number": pandera.Column(pandera.dtypes._Number),
"pandera.dtypes._PhysicalNumber": pandera.Column(pandera.dtypes._PhysicalNumber),
"pandera.dtypes.Int": pandera.Column(pandera.dtypes.Int),
"pandera.dtypes.Int64": pandera.Column(pandera.dtypes.Int64),
... etc
})
I hope that's how I'm supposed to do it.
Thanks @a-recknagel !
Not sure what implementation you opted for, but revisiting the code snippet I posted above, I think a more robust approach would be:
schema = Schema.to_schema()
schema.coerce = True
empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))
This piggy-backs on pandera's coercion logic, and you should be able to use the pandera DataType
subclasses in your test.
But can all of these be safely dropped from the test?
I'd ignore the abstract DataType
s, (basically test dtypes supported by pandas_engine
... pandas-supported types like Period
and Interval
should be included.
PydanticModel
is an interesting case, I'm not sure, but I don't think it'll work given that it needs rows to make coercion work. I'd recommend special-casing a TypeError
when using empty()
with the PydanticModel
for now.
I created a PR from my in-progress branch, changing the way the empty dataframe is created to leverage coercion didn't seem to change the failing cases.
PydanticModel is an interesting case, I'm not sure, but I don't think it'll work given that it needs rows to make coercion work. I'd recommend special-casing a TypeError when using empty() with the PydanticModel for now.
You mean within the empty
function, right? I'll try tomorrow.
I know this is an old thread, but I came across it and it mostly worked for me except it wasn't preserving an index field. I added a line index=pd.Index([],name=schema.index.name, dtype=schema.index.dtype.type)
to get it to work:
class TestDf(pa.DataFrameModel):
dt: Index[datetime] = pa.Field(check_name=True)
col1: Series[int]
col2: Series[int]
col3: Series[int]
@classmethod
def empty(cls):
schema = cls.to_schema()
schema.coerce = True
index = pd.Index([],name=schema.index.name, dtype=schema.index.dtype.type)
empty_df =schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns], index=index))
return empty_df
@ssuffian My time is bound up elsewhere so I can't review this right now. If you want, you can cherry pick the changes from my branch and take over though.
Question about pandera
I need to be able to create an empty dataframe and (maybe) populate it later. I hoped this would be fairly straight forward like
np.empty(...)
but so far the best way I have found is to write anempty()
method for eachSchemaModel
I have that explicitly creates apd.DataFrame
with manually-maintained columns and dtypes. Have I overlooked something?