unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.37k stars 310 forks source link

Using regex for column names in SchemaModel #666

Closed markkvdb closed 3 years ago

markkvdb commented 3 years ago

Using regex for column names in SchemaModel

Is it possible to write a SchemaModel class in which the column names follow a regex pattern, e.g., ^[0-9]+$ for 0, 1, 2, 3, 4, etc?

If not, can I use the DataFrameSchema class in the same way as the SchemaModel class?

jeffzi commented 3 years ago

Hi @markkvdb,

You can use the regex argument of Field. You'll also have to pass the regex in alias if it is not a valid name for a class attribute (which is the case in your example):

import pandas as pd
import pandera as pa
from pandera.typing import Series

class Schema(pa.SchemaModel):
    a: Series[int] = pa.Field(alias="^[0-9]+$", regex=True, ge=0)

df = pd.DataFrame({"1": [-1]})
Schema.validate(df)
#> Traceback (most recent call last):
#> /tmp/ipykernel_259488/1760133918.py in <module>
#> ----> 1 Schema.validate(df)
...

#> SchemaError: <Schema Column(name=1, type=DataType(int64))> failed element-wise validator 0:
#> <Check greater_than_or_equal_to: greater_than_or_equal_to(0)>
#> failure cases:
#>    index  failure_case
#> 0      0            -1

If you define a DataFrameSchema instead, Column has a similar regex argument.

markkvdb commented 3 years ago

Thanks for your clear answer. It seems to work now! I already had it working with DataFrameSchema but all other schemas were defined using the SchemaModel, so using a single approach is a bit nicer.