add "nullable" option to SchemaModel Config class

benlindsay commented 2 years ago

I have a situation where almost all of the columns in my schemas are nullable, and it would be nice to set nullable = True as a config option instead of setting nullable=True for every column. For example, instead of this:

import pandera as pa
from pandera.typing import Series, DataFrame

class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float] = pa.Field(nullable=True)
    nullable_col_2: Series[float] = pa.Field(nullable=True)
    nullable_col_3: Series[float] = pa.Field(nullable=True)
    nullable_col_4: Series[float] = pa.Field(nullable=True)
    nullable_col_5: Series[float]
    nullable_col_6: Series[float] = pa.Field(nullable=True)

I'd love to be able to do this or something like it:

import pandera as pa
from pandera.typing import Series, DataFrame

class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float]
    nullable_col_2: Series[float]
    nullable_col_3: Series[float]
    nullable_col_4: Series[float]
    nullable_col_5: Series[float] = pa.Field(nullable=False)
    nullable_col_6: Series[float]

    class Config:
        nullable = True

jeffzi commented 2 years ago

Hi @benlindsay. I agree repeating nullable can be verbose and cumbersome.

Pandera strives to keep feature parity between SchemaModel and DataFrameSchema. So we would need to introduce a similar option to DataFrameSchema. This is similar to the coerce option that can be set at the schema level with both SchemaModel and DataFrameSchema apis. We should also have similar defaults for required, and possibly unique and allow_duplicates. The default pandera.Check(ignore_na=True) is also often asked about. Can you think of other candidates?

An alternative would be a global config similar to the pandas options. It allows global and local (via a contextmanager) defaults overriding.


## global, applies to all schemas

pandera.options.nullable = True 

## local

with pandera.options("nullable", True):

    class MySchema(pa.SchemaModel):
        nullable_col_1: Series[float]
        col_2: Series[float] = pa.Field(nullable=False)

One downside of a global config is that you would need to read both the schema definition and config to have the complete picture. The config could be stored in a separate python module, which could lead to surprising results for someone who only read the schema. I personally prefer the schema to be self-contained, without relying on side effects. On the upside, this mechanism avoids the proliferation of schema arguments.

An imperfect solution to reduce verbosity that you can apply right away:

from typing import Any
import pandera as pa
from pandera.typing import Series

## SchemaModel

def Nullable(*args: Any, **kwargs: Any) -> pa.Field:
    kwargs["nullable"] = True
    return pa.Field(*args, **kwargs)

class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float] = Nullable()
    col_2: Series[float]

## DataFrameSchema

def NullableCol(dtype: Any, *args: Any, **kwargs: Any) -> pa.Column:
    """Column with nullable=True by default."""
    kwargs["nullable"] = True
    return pa.Column(dtype, *args, **kwargs)

schema = pa.DataFrameSchema(
    {"nullable_col_1": NullableCol(float), "col_2": pa.Column(float)}
)

vovavili commented 2 years ago

An imperfect solution to reduce verbosity that you can apply right away:

from typing import Any
import pandera as pa
from pandera.typing import Series

## SchemaModel

def Nullable(*args: Any, **kwargs: Any) -> pa.Field:
    kwargs["nullable"] = True
    return pa.Field(*args, **kwargs)

class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float] = Nullable()
    col_2: Series[float]

## DataFrameSchema

def NullableCol(dtype: Any, *args: Any, **kwargs: Any) -> pa.Column:
    """Column with nullable=True by default."""
    kwargs["nullable"] = True
    return pa.Column(dtype, *args, **kwargs)

schema = pa.DataFrameSchema(
    {"nullable_col_1": NullableCol(float), "col_2": pa.Column(float)}
)

@jeffzi This one is a lifesaver, merci beacoup!

cosmicBboy commented 2 years ago

if you're into functools, you can also do something like:

from functools import partial
import pandera as pa

NullableField = partial(pa.Field, nullable=True)

cosmicBboy commented 2 years ago

Would welcome a PR to add this option at the dataframe-schema level!

To open up a discussion about the semantics of options at the dataframe- and field- (column and index) level, the existing option of coerce has the current behavior: coerce=True will override any field with coerce=False. This seems to be unintuitive to me... i.e. it seems like DataFrameSchema(..., coerce=True) should define the default global setting and any options at more granular levels should override the global seetting.

Similarly, DataFrameSchema(..., nullable=True) should define the global setting and Column(..., nullable=False) should override that.

What do y'all think @benlindsay @jeffzi @vovavili ?

vovavili commented 2 years ago

Would welcome a PR to add this option at the dataframe-schema level!

To open up a discussion about the semantics of options at the dataframe- and field- (column and index) level, the existing option of coerce has the current behavior: coerce=True will override any field with coerce=False. This seems to be unintuitive to me... i.e. it seems like DataFrameSchema(..., coerce=True) should define the default global setting and any options at more granular levels should override the global seetting.

Similarly, DataFrameSchema(..., nullable=True) should define the global setting and Column(..., nullable=False) should override that.

What do y'all think @benlindsay @jeffzi @vovavili ?

@cosmicBboy That sounds like a dream option for me! Would save me so much time. Full support.

jeffzi commented 2 years ago

it seems like DataFrameSchema(..., coerce=True) should define the default global setting and any options at more granular levels should override the global seetting.

Agreed. To be accurate, it should be "any explicitly set options at more granular levels levels should override the global setting." coerce=False is the default field option, if omitted the global coerce=True should indeed override it. Technically, we'd need a sentinel value to differentiate unset arguments from defaults. See the unapproved Pep 661 - Sentinel values.

blais commented 1 year ago

I ran against the need for this today as well. My output schemas are all coerce on input and nullable on all outputs across an API surface; I was going to make a base class for the output models but I couldn't do the output bit.

cosmicBboy commented 1 year ago

cool. @blais the pandera internals re-write is pretty much done (just have to clean up a few more things) but after that this feature should be fairly easy to support.

aidiss commented 1 year ago

I might consider contributing to this one.

Could you briefly mention what file,classes should be modified/created. Not sure where to start from. @cosmicBboy

blais commented 1 year ago

Thank you,

cosmicBboy commented 1 year ago

Thanks @aidiss !

So I think a reasonable approach here is to support a dataframe-level default that can be overriden at the schema-component (column or index level).

Here are the changes that need to be made:

Add nullable option at DataFrameSchema.__init__ which should be stored as a self.nullable instance attribute. This should be None by default.
Need to change the default value of nullable in {ArraySchema, SeriesSchema, Column, Index}.__init__ to None so that we can get the correct behavior in the point below. Will also need to turn self.nullable into a private variable self._nullable, and expose nullable as a @property, which returns False if self._nullable is None. This is so that we can distinguish between default behavior and user-provided values, as explained in the point below.
The df-level default should be propagated at validation-time so we don't risk change the state of the schema components, so basically the DataFrameSchemaBackend.collect_schema_components method needs to be updated so that the df-level value is set on the Column._nullable property of the copied col object. However, if Column._nullable is not None, i.e. that the user provided a value, then the df-level nullable value shouldn't be applied to the column. This works out nicely because the Column.nullable @property method will default to False if Column._nullable is None. (btw, do we want to support nullability in the Index? In that case we'll need to propagate that logic to apply to the index schema component as well.)
The nullable attribute needs to be added to the BaseConfig class for the class-based API.
Update the kwargs here, to include the new option.
Add tests for DataFrameSchema here and for DataFrameModel here.
Update this docs page to include a subheading explaining the behavior of this new option.

Be sure to check out the contributing guide before you get started, and let me know if you have any questions! I'll be OOO for the next two weeks but can answer any questions you have after I get back from vacation.

crsren commented 4 months ago

Hey @cosmicBboy @aidiss, any update on this? Happy to pick this up if it's still open, been hoping for this to become a thing for a while and keep coming back to this issue. :)

My specific use case is something like:

import pandera as pa
from pandera.typing import Series

class Hamburger(pa.DataFrameModel):
    patty: Series[int]
    cheese: Series[bool] = pa.Field(nullable=True)
    tomato: Series[int]
    salat: Series[str]

class VeganHamburger(pa.DataFrameModel):
    class Config:
        nullable = True

unionai-oss / pandera

add "nullable" option to SchemaModel Config class #742