unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

add "nullable" option to SchemaModel Config class #742

Open benlindsay opened 2 years ago

benlindsay commented 2 years ago

I have a situation where almost all of the columns in my schemas are nullable, and it would be nice to set nullable = True as a config option instead of setting nullable=True for every column. For example, instead of this:

import pandera as pa
from pandera.typing import Series, DataFrame

class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float] = pa.Field(nullable=True)
    nullable_col_2: Series[float] = pa.Field(nullable=True)
    nullable_col_3: Series[float] = pa.Field(nullable=True)
    nullable_col_4: Series[float] = pa.Field(nullable=True)
    nullable_col_5: Series[float]
    nullable_col_6: Series[float] = pa.Field(nullable=True)

I'd love to be able to do this or something like it:

import pandera as pa
from pandera.typing import Series, DataFrame

class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float]
    nullable_col_2: Series[float]
    nullable_col_3: Series[float]
    nullable_col_4: Series[float]
    nullable_col_5: Series[float] = pa.Field(nullable=False)
    nullable_col_6: Series[float]

    class Config:
        nullable = True
jeffzi commented 2 years ago

Hi @benlindsay. I agree repeating nullable can be verbose and cumbersome.

Pandera strives to keep feature parity between SchemaModel and DataFrameSchema. So we would need to introduce a similar option to DataFrameSchema. This is similar to the coerce option that can be set at the schema level with both SchemaModel and DataFrameSchema apis. We should also have similar defaults for required, and possibly unique and allow_duplicates. The default pandera.Check(ignore_na=True) is also often asked about. Can you think of other candidates?

An alternative would be a global config similar to the pandas options. It allows global and local (via a contextmanager) defaults overriding.


## global, applies to all schemas

pandera.options.nullable = True 

## local

with pandera.options("nullable", True):

    class MySchema(pa.SchemaModel):
        nullable_col_1: Series[float]
        col_2: Series[float] = pa.Field(nullable=False)

One downside of a global config is that you would need to read both the schema definition and config to have the complete picture. The config could be stored in a separate python module, which could lead to surprising results for someone who only read the schema. I personally prefer the schema to be self-contained, without relying on side effects. On the upside, this mechanism avoids the proliferation of schema arguments.

An imperfect solution to reduce verbosity that you can apply right away:

from typing import Any
import pandera as pa
from pandera.typing import Series

## SchemaModel

def Nullable(*args: Any, **kwargs: Any) -> pa.Field:
    kwargs["nullable"] = True
    return pa.Field(*args, **kwargs)

class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float] = Nullable()
    col_2: Series[float]

## DataFrameSchema

def NullableCol(dtype: Any, *args: Any, **kwargs: Any) -> pa.Column:
    """Column with nullable=True by default."""
    kwargs["nullable"] = True
    return pa.Column(dtype, *args, **kwargs)

schema = pa.DataFrameSchema(
    {"nullable_col_1": NullableCol(float), "col_2": pa.Column(float)}
)
vovavili commented 2 years ago

An imperfect solution to reduce verbosity that you can apply right away:

from typing import Any
import pandera as pa
from pandera.typing import Series

## SchemaModel

def Nullable(*args: Any, **kwargs: Any) -> pa.Field:
    kwargs["nullable"] = True
    return pa.Field(*args, **kwargs)

class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float] = Nullable()
    col_2: Series[float]

## DataFrameSchema

def NullableCol(dtype: Any, *args: Any, **kwargs: Any) -> pa.Column:
    """Column with nullable=True by default."""
    kwargs["nullable"] = True
    return pa.Column(dtype, *args, **kwargs)

schema = pa.DataFrameSchema(
    {"nullable_col_1": NullableCol(float), "col_2": pa.Column(float)}
)

@jeffzi This one is a lifesaver, merci beacoup!

cosmicBboy commented 2 years ago

if you're into functools, you can also do something like:

from functools import partial
import pandera as pa

NullableField = partial(pa.Field, nullable=True)
cosmicBboy commented 2 years ago

Would welcome a PR to add this option at the dataframe-schema level!

To open up a discussion about the semantics of options at the dataframe- and field- (column and index) level, the existing option of coerce has the current behavior: coerce=True will override any field with coerce=False. This seems to be unintuitive to me... i.e. it seems like DataFrameSchema(..., coerce=True) should define the default global setting and any options at more granular levels should override the global seetting.

Similarly, DataFrameSchema(..., nullable=True) should define the global setting and Column(..., nullable=False) should override that.

What do y'all think @benlindsay @jeffzi @vovavili ?

vovavili commented 2 years ago

Would welcome a PR to add this option at the dataframe-schema level!

To open up a discussion about the semantics of options at the dataframe- and field- (column and index) level, the existing option of coerce has the current behavior: coerce=True will override any field with coerce=False. This seems to be unintuitive to me... i.e. it seems like DataFrameSchema(..., coerce=True) should define the default global setting and any options at more granular levels should override the global seetting.

Similarly, DataFrameSchema(..., nullable=True) should define the global setting and Column(..., nullable=False) should override that.

What do y'all think @benlindsay @jeffzi @vovavili ?

@cosmicBboy That sounds like a dream option for me! Would save me so much time. Full support.

jeffzi commented 2 years ago

it seems like DataFrameSchema(..., coerce=True) should define the default global setting and any options at more granular levels should override the global seetting.

Agreed. To be accurate, it should be "any explicitly set options at more granular levels levels should override the global setting." coerce=False is the default field option, if omitted the global coerce=True should indeed override it. Technically, we'd need a sentinel value to differentiate unset arguments from defaults. See the unapproved Pep 661 - Sentinel values.

blais commented 1 year ago

I ran against the need for this today as well. My output schemas are all coerce on input and nullable on all outputs across an API surface; I was going to make a base class for the output models but I couldn't do the output bit.

cosmicBboy commented 1 year ago

cool. @blais the pandera internals re-write is pretty much done (just have to clean up a few more things) but after that this feature should be fairly easy to support.

aidiss commented 1 year ago

I might consider contributing to this one.

Could you briefly mention what file,classes should be modified/created. Not sure where to start from. @cosmicBboy

blais commented 1 year ago

Thank you,

cosmicBboy commented 1 year ago

Thanks @aidiss !

So I think a reasonable approach here is to support a dataframe-level default that can be overriden at the schema-component (column or index level).

Here are the changes that need to be made:

Be sure to check out the contributing guide before you get started, and let me know if you have any questions! I'll be OOO for the next two weeks but can answer any questions you have after I get back from vacation.

crsren commented 4 months ago

Hey @cosmicBboy @aidiss, any update on this? Happy to pick this up if it's still open, been hoping for this to become a thing for a while and keep coming back to this issue. :)

My specific use case is something like:

import pandera as pa
from pandera.typing import Series

class Hamburger(pa.DataFrameModel):
    patty: Series[int]
    cheese: Series[bool] = pa.Field(nullable=True)
    tomato: Series[int]
    salat: Series[str]

class VeganHamburger(pa.DataFrameModel):
    class Config:
        nullable = True