Closed Christoph-1 closed 3 years ago
You can use the strict
argument to ensure that all and only the columns defined in the schema are present in the dataframe:
import pandas as pd
import pandera as pa
df = pd.DataFrame({"1": [0], "foo": ["a"]})
schema = pa.DataFrameSchema(
columns={r"\d+": pa.Column(regex=True)},
strict=True,
)
schema.validate(df, lazy=True)
#> Traceback (most recent call last):
#> ...
#> SchemaErrors: A total of 1 schema errors were found.
#> Error Counts
#> - column_not_in_schema: 1
#> Schema Error Summary
#> failure_cases n_failure_cases
#> schema_context column check
#> DataFrameSchema <NA> column_in_schema [foo] 1
#> Usage Tip
#> Directly inspect all errors by catching the exception:
#> ```
#> try:
#> schema.validate(dataframe, lazy=True)
#> except SchemaErrors as err:
#> err.failure_cases # dataframe of schema errors
#> err.data # invalid dataframe
#> ```
Unfortunately, the SchemaModel
api does not support the omission of data types. I've just opened an issue about it and will work on its resolution (#522).
If all your columns have the same type (e.g int), this will work with current version:
import pandera as pa
class Schema(pa.SchemaModel):
columns: pa.typing.Series[int] = pa.Field(alias=r"\d+", regex=True)
class Config:
strict = True
@jeffzi I think the question is about a case where the column names are of type int instead of string that represent integers:
import pandas as pd
import pandera as pa
df = pd.DataFrame({1: [0]})
schema = pa.DataFrameSchema(
columns={r"\d+": pa.Column(regex=True)},
strict=True,
)
schema.validate(df, lazy=True)
# Traceback (most recent call last):
# ...
# schema.validate(df, lazy=True)
# File "...\pandera\schemas.py", line 465, in validate
# col_schema.get_regex_columns(check_obj.columns)
# File "...\pandera\schema_components.py", line 248, in get_regex_columns
# pd.Index(columns.str.match(self.name))
# File "...\pandas\core\accessor.py", line 187, in __get__
# accessor_obj = self._accessor(obj)
# File "...\pandas\core\strings.py", line 2041, in __init__
# self._inferred_dtype = self._validate(data)
# File "...\pandas\core\strings.py", line 2098, in _validate
# raise AttributeError("Can only use .str accessor with string values!")
# AttributeError: Can only use .str accessor with string values!
@SebbanSms In that case, the problem is that regex=True
implicitly requires the DataFrame columns "index" to be of of type str
.
I think it would be possible to cast to type str here. @cosmicBboy Do you anticipate side effects?
@jeffzi how would casting to string work only for the validation using the @check_types decorator?
Also if there are already string integers:
df = pd.DataFrame({1: [0], '2': [0]})
the DataFrame should still raise an exception. In this case only for the column name '2'
but not of the 1
.
Context: The code following working with the checked DataFrame expects the numbered column names to be of type int.
So far we came up with this work around to also perform some checks on the integer columns:
import pandas as pd
import pandera as pa
from pandera import check_types
from pandera.typing import DataFrame, Series
class Test(pa.SchemaModel):
other_column: Series[str] = pa.Field(nullable=True)
@pa.dataframe_check
def some_columns_are_integers(cls, df: pd.DataFrame) -> bool:
# to check if values are correct
schema = pa.SeriesSchema(pa.Int, checks=[pa.Check.greater_than_or_equal_to(0)])
# to check if values are correct to check if column names are correct
other_column_names = list(cls.__fields__.keys())
for col in df.columns:
if col not in other_column_names:
assert isinstance(col, int), \
f'Found unexpected column name: {col}\n' \
f'Apart from expected other columns: {other_column_names}\n' \
f'only columns with integers are allowed in this DataFrame.'
schema.validate(df[col])
return True
class Config:
strict = False
Some tests:
@check_types
def test_df_no_error() -> DataFrame[Test]:
return pd.DataFrame(
{
'other_column': ['unimportant', 'unimportant', 'unimportant'],
1: [1, 2, 3],
2: [1, 2, 3],
3: [1, 2, 3],
4: [1, 2, 3],
5: [1, 2, 3]
}
)
@check_types
def test_df_str_cols() -> DataFrame[Test]:
return pd.DataFrame(
{
'other_column': ['unimportant', 'unimportant', 'unimportant'],
1: [1, 2, 3],
2: [1, 2, 3],
3: [1, 2, 3],
4: [1, 2, 3],
'5': [1, 2, 3]
}
)
@check_types
def test_df_negative_values() -> DataFrame[Test]:
return pd.DataFrame(
{
'other_column': ['unimportant', 'unimportant', 'unimportant'],
1: [-1, -2, -3],
2: [1, 2, 3],
3: [1, 2, 3],
4: [1, 2, 3],
5: [1, 2, 3]
}
)
test_df_no_error()
print('no error until here')
try:
test_df_str_cols()
except Exception as e:
print(e)
try:
test_df_negative_values()
except Exception as e:
print(e)
results:
no error until here
Found unexpected column name: 5
Apart from expected other columns: ['other_column']
only columns with integers are allowed in this DataFrame.
error in check_types decorator of function 'test_df_negative_values': <Schema SeriesSchema(name=None, type=int)> failed element-wise validator 0:
<Check greater_than_or_equal_to: greater_than_or_equal_to(0)>
failure cases:
index failure_case
0 0 -1
1 1 -2
2 2 -3
hey @Christoph-1 @SebbanSms this is an interesting use case that pandera currently doesn't support very elegantly, the workaround you posted is the direction I was about to suggest!
Non-string column names do seem like an important thing to validate as part of a schema, currently pandera assumes (i) for regex=True
that all column names are strings and (ii) non-string column names are known ahead of time.
What you are asking for is meta-validation of dataframe metadata (the types and allowable values of column name keys)... indeed the alias
and regex
options are sort of going in this direction.
I'd like to figure out an ergonomic way of doing this... here's a proposal off the top of my head:
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={
# column name key is a type, not a literal
int: pa.Column(
int,
checks=[
pa.Check.gt(0), # check that values in the column are positive
pa.Check(lambda s: s.name > 0) # check that the column key itself is positive
]
),
"other_column": pa.Column(str),
},
strict=True
)
df = pd.DataFrame({
1: [0],
'2': [0],
"other_column": ["foo"],
})
# SchemaModel version
class Schema(pa.SchemaModel)
# alias argument is a type, not a literal
integer_columns: pa.typing.Series[int] = Field(alias=int, gte=0)
other_column: pa.typing.Series[str]
@pa.check(int)
def check_integer_column_keys(cls, series):
return series.name > 0
class Config:
strict = True
This solution would match all integer column keys and apply the Column/Field checks to those columns. The "2"
column name will be caught by the strict=True
flag.
@cosmicBboy looks like a good solution. You would need to change the implementation of the alias
arg to accept types as input for this to work, right?
I'd propose to also be prepared for other pivot tables (in matrix format) as well, to also allow a list of column names as well.
E.g. if we have a DataFrame that has some price columns for products over time: | date | mp3_player | smartphone | computer | ... |
---|---|---|---|---|---|
2021-01-01 | 25 | 190 | 450 | ... | |
2021-01-02 | 30 | 185 | 500 | ... | |
... | ... | ... | ... | ... |
it would also be helpful to have a list of column names that have the same datatype and checks that can be used instead of blowing up the SchemaModel
with repeated information
product_list = [mp3_player, smartphone, computer, ...]
...
product_price_matrix: pa.typing.Series[int] = Field(alias=product_list, ge=0, ...)
Or in the integer use case, it would be cool to be able to further define the range of integers to be in column names. For example:
integer_list = range(1, 51)
specific_integer_columns: pa.typing.Series[int] = Field(alias=integer_list, gte=0)
btw, if you know the column keys before hand you can do this with DataFrameSchema
:
pa.DataFrameSchema({key: Column(...) for key in range(1, 51)})
The code is more concise but the problem remains that the schema object is sort of over-specified with duplicated column objects.
Thinking about this problem more, I feel like piggy-backing on alias
is a little bit of a hack... I think we can boil the problem down to pattern matching (sort of like in the structural pattern matching sense): we want to be able to match a Column/Index/Field specification to some key that we either:
The regex
option sort of does this, but only with strings. So here's an updated proposal:
matches
option to match column names to an iterable or typeimport pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={
"integer_columns": pa.Column(..., matches=int), # match a single type
"number_columns": pa.Column(..., matches=Union[int, float]), # match multiple types
"product_column": pa.Column(..., matches="foo"), # match a single literal, can be used in combination with regex=True
"product_columns": pa.Column(..., matches=["foo", "bar", "baz"]), # match multiple literals
},
strict=True
)
# SchemaModel version
class Schema(pa.SchemaModel)
# alias argument is a type, not a literal
integer_columns: pa.typing.Series[int] = Field(matches=int)
number_columns: pa.typing.Series[int] = Field(matches=Union[int, float])
product_column: pa.typing.Series[str] = Field(matches="foo")
product_columns: pa.typing.Series[str] = Field(matches=["foo", "bar", "baz"])
In SchemaModel
, specifying both the matches
and alias
option should raise a SchemaInitError
.
In summary, these are the various execution paths:
matches
not specified, use column name keystr
literal is provided e.g. matches="foo"
, use this to match column name (same behavior as alias
)str
literal is provided and regex=True
, e.g. matches="\d+"
, expands column names using regexList[literal]
is provided, match column names to this set of keystype
is provided, match column names of that typeList[type]
is provided, match column names of those types.@jeffzi thoughts?
Correct me if I'm wrong, but it sounds like there are 2 distinct problems:
it would also be helpful to have a list of column names that have the same datatype...
- Selecting columns based on their dtype (content) , in order to apply specific checks to integers for example.
Regarding, 1. What's the use case for a mixed type column index? I'm guessing that situation arises when pivoting the dataframe. Pandas does not handle that too well:
from datetime import date
import pandas as pd
df = pd.DataFrame({1: [0], "2": [0], date(2021, 1, 1): [0]})
print(df.columns.dtype) # mixed type falls back to object
#> object
print(df.columns.str.lower()) # str accessor has unexpected result
#> Index([nan, '2', nan], dtype='object')
print(df.columns.astype(str).str.lower()) # works as expected
#> Index(['1', '2', '2021-01-01'], dtype='object')
@cosmicBboy The matches
argument would apply to column name types, right? I'm wondering if it's worth adding complexity to the API. I was suggesting that pandera internally casts column names to string (see last line of code above) and let users rely on regex to select the column names. They'd need to give a regex like \d*
to select numbers, which should work as far as I know.
For the case mentioned by @SebbanSms :
df = pd.DataFrame({1: [0], '2': [0]}) the DataFrame should still raise an exception. In this case only for the column name '2' but not of the 1. Context: The code following working with the checked DataFrame expects the numbered column names to be of type int.
we want to select column names that are a number (regex solution works for that part) and also select columns that are of type integer.
Actually, some time ago I built a library to select pandas columns (inspired by R's dplyr). It's probably overkill for pandera but we could have Selector objects that can be combined with boolean operations:
import pandera as pa
from pandas_select import HasDtype, Match
# select columns whose names are a number and contain integers
pa.DataFrameSchema(
columns={
(HasDtype(int) & Match("\d+")): pa.Column(checks=...), # could create HasColumnNameType to replace Match,
# selectors are optional
"foo": pa.Column(...),
},
)
Selectors could be used for other schema operations: update_columns, etc. That way we can have a zoo of selectors instead of editing the schema/column api as new cases turn up.
it would also be helpful to have a list of column names that have the same datatype...
Sorry, my wording was not very precise there. I actually meant if columns share the same Field
, so also the same checks and so forth.
Coming back to the price per product and time example:
I may want to check that all prices are greater than zero, have no extreme fluctuation on a daily basis, and so forth.
So I rather meant columns that carry the same type of data (not all columns having the same dtype
)
All price per product columns in the given example would probably have some basic checks that should be repeated for them. I think we still have to define which columns are price per product columns separately, e.g. by a list of product names
So selecting columns based on their dtype (content) would not be needed for this example.
You can already select a list of column names with a regex:
import pandas as pd
import pandera as pa
product_list = ["mp3_player", "smartphone", "computer"]
products = r"|".join(product_list)
schema = pa.DataFrameSchema({products: pa.Column(checks=[pa.Check.ge(0)], regex=True)})
schema.validate(pd.DataFrame({"smartphone": [-1]}))
#> Traceback (most recent call last):
#> ...
#> SchemaError: <Schema Column(name=smartphone, type=None)> failed element-wise validator 0:
#> <Check greater_than_or_equal_to: greater_than_or_equal_to(0)>
#> failure cases:
#> index failure_case
#> 0 0 -1
I forgot to mention that a drawback of my Selector
proposal is that we'd need an argument pa.Field(selectors=...)
, similar to @cosmicBboy 's matches
. In my opinion, it's not worth introducing matches
if it just enables the seemingly uncommon case of mixed typed column names, unless that case cannot be handled by the proposed internal str
casting + regex.
okay, now I'm confused... @SebbanSms @Christoph-1 can you please confirm these use cases?
From my understanding, you want to:
So I rather meant columns that carry the same type of data (not all columns having the same dtype)
Can you clarify this @SebbanSms? I can't tell what the contrast is.
I was suggesting that pandera internally casts column names to string (see last line of code above) and let users rely on regex to select the column names.
I think this would be the path of least resistance to support (1). Would need to update the code here and here to convert the column names to strings (making sure the original column name types are preserved).
For (2), a custom Check
would cover that:
schema = pa.DataFrameSchema(
columns={
"\d+": pa.Column(
int,
# check that the column key is a positive int
checks=pa.Check(lambda s: isinstance(s.name, int) and s.name > 0),
regex=True,
),
"other_column": pa.Column(str),
},
strict=True
)
As @jeffzi pointed out, this solution covers any case where a finite number of column names (of any type that can be cast into str) are known beforehand:
import pandas as pd
import pandera as pa
product_list = ["mp3_player", "smartphone", "computer"]
products = r"|".join(product_list)
integers = r"|".join(range(1, 51))
schema = pa.DataFrameSchema({
products: pa.Column(checks=[pa.Check.ge(0)], regex=True)
integers: pa.Column(..., regex=True)
})
@cosmicBboy In our specific case making sure that columns that have a positive integer as column name share the same checks would be enough.
same type of data
Sorry once again for the imprecise language. With this I meant columns that should have the same checks (e.g. checks for all price per product columns)
Thanks for the example on how to do this with a list of string columns in combination with regex, I'll note it down for the future.
Question about pandera
Through aliases and regex it is possible to check for an arbitrary number of column names whether they are valid. The column names are strings. In the below example all integers above 0 are valid column names.
Is there a similar check if the column names are integers? I have a data frame with an arbitrary number of columns whose names are integers and I want to implement a check that validates the column names.