Open benlindsay opened 2 years ago
Hi @benlindsay. I agree repeating nullable
can be verbose and cumbersome.
Pandera strives to keep feature parity between SchemaModel
and DataFrameSchema
. So we would need to introduce a similar option to DataFrameSchema
. This is similar to the coerce
option that can be set at the schema level with both SchemaModel and DataFrameSchema apis. We should also have similar defaults for required
, and possibly unique
and allow_duplicates
. The default pandera.Check(ignore_na=True)
is also often asked about. Can you think of other candidates?
An alternative would be a global config similar to the pandas options. It allows global and local (via a contextmanager) defaults overriding.
## global, applies to all schemas
pandera.options.nullable = True
## local
with pandera.options("nullable", True):
class MySchema(pa.SchemaModel):
nullable_col_1: Series[float]
col_2: Series[float] = pa.Field(nullable=False)
One downside of a global config is that you would need to read both the schema definition and config to have the complete picture. The config could be stored in a separate python module, which could lead to surprising results for someone who only read the schema. I personally prefer the schema to be self-contained, without relying on side effects. On the upside, this mechanism avoids the proliferation of schema arguments.
An imperfect solution to reduce verbosity that you can apply right away:
from typing import Any
import pandera as pa
from pandera.typing import Series
## SchemaModel
def Nullable(*args: Any, **kwargs: Any) -> pa.Field:
kwargs["nullable"] = True
return pa.Field(*args, **kwargs)
class MySchema(pa.SchemaModel):
nullable_col_1: Series[float] = Nullable()
col_2: Series[float]
## DataFrameSchema
def NullableCol(dtype: Any, *args: Any, **kwargs: Any) -> pa.Column:
"""Column with nullable=True by default."""
kwargs["nullable"] = True
return pa.Column(dtype, *args, **kwargs)
schema = pa.DataFrameSchema(
{"nullable_col_1": NullableCol(float), "col_2": pa.Column(float)}
)
An imperfect solution to reduce verbosity that you can apply right away:
from typing import Any import pandera as pa from pandera.typing import Series ## SchemaModel def Nullable(*args: Any, **kwargs: Any) -> pa.Field: kwargs["nullable"] = True return pa.Field(*args, **kwargs) class MySchema(pa.SchemaModel): nullable_col_1: Series[float] = Nullable() col_2: Series[float] ## DataFrameSchema def NullableCol(dtype: Any, *args: Any, **kwargs: Any) -> pa.Column: """Column with nullable=True by default.""" kwargs["nullable"] = True return pa.Column(dtype, *args, **kwargs) schema = pa.DataFrameSchema( {"nullable_col_1": NullableCol(float), "col_2": pa.Column(float)} )
@jeffzi This one is a lifesaver, merci beacoup!
if you're into functools
, you can also do something like:
from functools import partial
import pandera as pa
NullableField = partial(pa.Field, nullable=True)
Would welcome a PR to add this option at the dataframe-schema level!
To open up a discussion about the semantics of options at the dataframe- and field- (column and index) level, the existing option of coerce
has the current behavior: coerce=True
will override any field with coerce=False
. This seems to be unintuitive to me... i.e. it seems like DataFrameSchema(..., coerce=True)
should define the default global setting and any options at more granular levels should override the global seetting.
Similarly, DataFrameSchema(..., nullable=True)
should define the global setting and Column(..., nullable=False)
should override that.
What do y'all think @benlindsay @jeffzi @vovavili ?
Would welcome a PR to add this option at the dataframe-schema level!
To open up a discussion about the semantics of options at the dataframe- and field- (column and index) level, the existing option of
coerce
has the current behavior:coerce=True
will override any field withcoerce=False
. This seems to be unintuitive to me... i.e. it seems likeDataFrameSchema(..., coerce=True)
should define the default global setting and any options at more granular levels should override the global seetting.Similarly,
DataFrameSchema(..., nullable=True)
should define the global setting andColumn(..., nullable=False)
should override that.What do y'all think @benlindsay @jeffzi @vovavili ?
@cosmicBboy That sounds like a dream option for me! Would save me so much time. Full support.
it seems like DataFrameSchema(..., coerce=True) should define the default global setting and any options at more granular levels should override the global seetting.
Agreed. To be accurate, it should be "any explicitly set options at more granular levels levels should override the global setting." coerce=False
is the default field option, if omitted the global coerce=True
should indeed override it. Technically, we'd need a sentinel value to differentiate unset arguments from defaults. See the unapproved Pep 661 - Sentinel values.
I ran against the need for this today as well. My output schemas are all coerce on input and nullable on all outputs across an API surface; I was going to make a base class for the output models but I couldn't do the output bit.
cool. @blais the pandera internals re-write is pretty much done (just have to clean up a few more things) but after that this feature should be fairly easy to support.
I might consider contributing to this one.
Could you briefly mention what file,classes should be modified/created. Not sure where to start from. @cosmicBboy
Thank you,
Thanks @aidiss !
So I think a reasonable approach here is to support a dataframe-level default that can be overriden at the schema-component (column or index level).
Here are the changes that need to be made:
DataFrameSchema.__init__
which should be stored as a self.nullable
instance attribute. This should be None
by default.{ArraySchema, SeriesSchema, Column, Index}.__init__
to None
so that we can get the correct behavior in the point below. Will also need to turn self.nullable
into a private variable self._nullable
, and expose nullable
as a @property
, which returns False
if self._nullable
is None
. This is so that we can distinguish between default behavior and user-provided values, as explained in the point below.DataFrameSchemaBackend.collect_schema_components
method needs to be updated so that the df-level value is set on the Column._nullable
property of the copied col
object. However, if Column._nullable
is not None
, i.e. that the user provided a value, then the df-level nullable value shouldn't be applied to the column. This works out nicely because the Column.nullable
@property
method will default to False
if Column._nullable
is None. (btw, do we want to support nullability in the Index? In that case we'll need to propagate that logic to apply to the index schema component as well.)nullable
attribute needs to be added to the BaseConfig
class for the class-based API.kwargs
here, to include the new option.DataFrameSchema
here and for DataFrameModel
here.Be sure to check out the contributing guide before you get started, and let me know if you have any questions! I'll be OOO for the next two weeks but can answer any questions you have after I get back from vacation.
Hey @cosmicBboy @aidiss, any update on this? Happy to pick this up if it's still open, been hoping for this to become a thing for a while and keep coming back to this issue. :)
My specific use case is something like:
import pandera as pa
from pandera.typing import Series
class Hamburger(pa.DataFrameModel):
patty: Series[int]
cheese: Series[bool] = pa.Field(nullable=True)
tomato: Series[int]
salat: Series[str]
class VeganHamburger(pa.DataFrameModel):
class Config:
nullable = True
I have a situation where almost all of the columns in my schemas are nullable, and it would be nice to set
nullable = True
as a config option instead of settingnullable=True
for every column. For example, instead of this:I'd love to be able to do this or something like it: