Open MikiGrit opened 2 years ago
Similar in spirit to my other issue (https://github.com/pandera-dev/pandera/issues/706), but distinct. would love to see both of these implemented when someone gets time
Hi @MikiGrit thanks for articulating the feature request! gonna ping @aodj here too, who created a very similar issue (#893).
In short: yes! I give my blessing to support this feature 😀. The use case is clear and will provide value to a lot of other folks using pandera. A related issue is #502, which allows users to fill in default values in a column... this would take it to another level, filling in missing columns (potentially with a default value?)
Just a quick pre-amble: I've been doing a major overhaul of pandera to abstract out all the pandas-specific logic into its own set of modules/classes as part of #381, and I think this change is a good candidate for figuring out if the next-gen schema abstraction is easy to extend.
The working branch is here: https://github.com/unionai-oss/pandera/tree/core-schema
Add a new option to DataFrameSchema
(and SchemaModel.Config
) called add_missing_columns
, which adds missing columns if True
, and is `False by default.
In this first iteration of the feature, this option should only work with nullable columns, and will raise a SchemaError if it's not nullable. This restriction should be lifted once users can specify a default value #502.
schema = pa.DataFrameSchema({
"col1": pa.Column(int),
"col2": pa.Column(int, nullable=True),
})
data = pd.DataFrame({"col1": [1]})
validated_data = schema(data)
# validated_data should now have a "col2" column, which contains all null values.
The new pandera module structure consists of core
modules and backend
modules. This functionality would live in the backend
modules, which implement the actual validation logic. See here for the strict_filter_columns
implementation, which is invoked in the backends.pandas.container.DataFrameSchemaBackend.validate
method.
add_missing_columns
option to the core.pandas.container.DataFrameSchema
class.DataFrameSchemaBackend.validate
, which receives the schema objectDataFrameSchemaBackend
method implements the add_missing_columns
functionality and should be invoked in DataFrameSchemaBackend.validate
.@MikiGrit @aodj let me know if either of you have the capacity to implement this feature! I can help guide/answer any questions!
I would like to use pandera to validate columns that are missing in the dataframe but are nullable, so they can be safely added. Would it be possible? Perhaps with some new config key like
coerce_columns=True
?Why I'm trying to do this is that I'm parsing XML files and some fields may be missing. As the schema of the resulting dataframe is defined only in one place (pandera SchemaModel), I would like to be able to dynamically read everything from the files (e.g. with
{xml_value.attrib['name']: xml_value.text for xml_value in xml_values.findall('value')}
) and later add columns that are missing by the pandera validate check.