unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Add nullable column when missing. #687

Open MikiGrit opened 2 years ago

MikiGrit commented 2 years ago

I would like to use pandera to validate columns that are missing in the dataframe but are nullable, so they can be safely added. Would it be possible? Perhaps with some new config key like coerce_columns=True?

Why I'm trying to do this is that I'm parsing XML files and some fields may be missing. As the schema of the resulting dataframe is defined only in one place (pandera SchemaModel), I would like to be able to dynamically read everything from the files (e.g. with {xml_value.attrib['name']: xml_value.text for xml_value in xml_values.findall('value')}) and later add columns that are missing by the pandera validate check.

benlindsay commented 2 years ago

Similar in spirit to my other issue (https://github.com/pandera-dev/pandera/issues/706), but distinct. would love to see both of these implemented when someone gets time

cosmicBboy commented 2 years ago

Hi @MikiGrit thanks for articulating the feature request! gonna ping @aodj here too, who created a very similar issue (#893).

In short: yes! I give my blessing to support this feature 😀. The use case is clear and will provide value to a lot of other folks using pandera. A related issue is #502, which allows users to fill in default values in a column... this would take it to another level, filling in missing columns (potentially with a default value?)

Just a quick pre-amble: I've been doing a major overhaul of pandera to abstract out all the pandas-specific logic into its own set of modules/classes as part of #381, and I think this change is a good candidate for figuring out if the next-gen schema abstraction is easy to extend.

The working branch is here: https://github.com/unionai-oss/pandera/tree/core-schema

Solution Proposal

Add a new option to DataFrameSchema (and SchemaModel.Config) called add_missing_columns, which adds missing columns if True, and is `False by default.

In this first iteration of the feature, this option should only work with nullable columns, and will raise a SchemaError if it's not nullable. This restriction should be lifted once users can specify a default value #502.

schema = pa.DataFrameSchema({
    "col1": pa.Column(int),
    "col2": pa.Column(int, nullable=True),
})

data = pd.DataFrame({"col1": [1]})

validated_data = schema(data)
# validated_data should now have a "col2" column, which contains all null values.

Steps to Implement

The new pandera module structure consists of core modules and backend modules. This functionality would live in the backend modules, which implement the actual validation logic. See here for the strict_filter_columns implementation, which is invoked in the backends.pandas.container.DataFrameSchemaBackend.validate method.

  1. Add the add_missing_columns option to the core.pandas.container.DataFrameSchema class.
  2. This option will then be available in DataFrameSchemaBackend.validate, which receives the schema object
  3. A new DataFrameSchemaBackend method implements the add_missing_columns functionality and should be invoked in DataFrameSchemaBackend.validate.

@MikiGrit @aodj let me know if either of you have the capacity to implement this feature! I can help guide/answer any questions!