unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.4k stars 311 forks source link

Integer as column names #521

Closed Christoph-1 closed 3 years ago

Christoph-1 commented 3 years ago

Question about pandera

Through aliases and regex it is possible to check for an arbitrary number of column names whether they are valid. The column names are strings. In the below example all integers above 0 are valid column names.

valid_names = "[1-9]\\d*|0"

class Schema:
    columns: pa.Field(
        alias=valid_names, regex=True
    )

Is there a similar check if the column names are integers? I have a data frame with an arbitrary number of columns whose names are integers and I want to implement a check that validates the column names.

jeffzi commented 3 years ago

You can use the strict argument to ensure that all and only the columns defined in the schema are present in the dataframe:

import pandas as pd
import pandera as pa

df = pd.DataFrame({"1": [0], "foo": ["a"]})

schema = pa.DataFrameSchema(
    columns={r"\d+": pa.Column(regex=True)},
    strict=True,
)
schema.validate(df, lazy=True)
#> Traceback (most recent call last):
#> ...
#> SchemaErrors: A total of 1 schema errors were found.
#> Error Counts
#> - column_not_in_schema: 1
#> Schema Error Summary
#>                                         failure_cases  n_failure_cases
#> schema_context  column check                                          
#> DataFrameSchema <NA>   column_in_schema         [foo]                1
#> Usage Tip
#> Directly inspect all errors by catching the exception:
#> ```
#> try:
#>     schema.validate(dataframe, lazy=True)
#> except SchemaErrors as err:
#>     err.failure_cases  # dataframe of schema errors
#>     err.data  # invalid dataframe
#> ```

Unfortunately, the SchemaModel api does not support the omission of data types. I've just opened an issue about it and will work on its resolution (#522).

If all your columns have the same type (e.g int), this will work with current version:

import pandera as pa

class Schema(pa.SchemaModel):
    columns: pa.typing.Series[int] = pa.Field(alias=r"\d+", regex=True)

    class Config:
        strict = True
SebBanDev commented 3 years ago

@jeffzi I think the question is about a case where the column names are of type int instead of string that represent integers:

import pandas as pd
import pandera as pa

df = pd.DataFrame({1: [0]})

schema = pa.DataFrameSchema(
    columns={r"\d+": pa.Column(regex=True)},
    strict=True,
)
schema.validate(df, lazy=True)

# Traceback (most recent call last):
# ...
#     schema.validate(df, lazy=True)
#   File "...\pandera\schemas.py", line 465, in validate
#     col_schema.get_regex_columns(check_obj.columns)
#   File "...\pandera\schema_components.py", line 248, in get_regex_columns
#     pd.Index(columns.str.match(self.name))
#   File "...\pandas\core\accessor.py", line 187, in __get__
#     accessor_obj = self._accessor(obj)
#   File "...\pandas\core\strings.py", line 2041, in __init__
#     self._inferred_dtype = self._validate(data)
#   File "...\pandas\core\strings.py", line 2098, in _validate
#     raise AttributeError("Can only use .str accessor with string values!")
# AttributeError: Can only use .str accessor with string values!
jeffzi commented 3 years ago

@SebbanSms In that case, the problem is that regex=True implicitly requires the DataFrame columns "index" to be of of type str.

I think it would be possible to cast to type str here. @cosmicBboy Do you anticipate side effects?

SebBanDev commented 3 years ago

@jeffzi how would casting to string work only for the validation using the @check_types decorator?

Also if there are already string integers: df = pd.DataFrame({1: [0], '2': [0]}) the DataFrame should still raise an exception. In this case only for the column name '2' but not of the 1. Context: The code following working with the checked DataFrame expects the numbered column names to be of type int.

SebBanDev commented 3 years ago

So far we came up with this work around to also perform some checks on the integer columns:

import pandas as pd
import pandera as pa
from pandera import check_types
from pandera.typing import DataFrame, Series

class Test(pa.SchemaModel):
    other_column: Series[str] = pa.Field(nullable=True)

    @pa.dataframe_check
    def some_columns_are_integers(cls, df: pd.DataFrame) -> bool:
        # to check if values are correct
        schema = pa.SeriesSchema(pa.Int, checks=[pa.Check.greater_than_or_equal_to(0)])

        # to check if values are correct to check if column names are correct
        other_column_names = list(cls.__fields__.keys())
        for col in df.columns:
            if col not in other_column_names:
                assert isinstance(col, int), \
                    f'Found unexpected column name: {col}\n' \
                    f'Apart from expected other columns: {other_column_names}\n' \
                    f'only columns with integers are allowed in this DataFrame.'
                schema.validate(df[col])
        return True

    class Config:
        strict = False

Some tests:

@check_types
def test_df_no_error() -> DataFrame[Test]:
    return pd.DataFrame(
        {
            'other_column': ['unimportant', 'unimportant', 'unimportant'],
            1: [1, 2, 3],
            2: [1, 2, 3],
            3: [1, 2, 3],
            4: [1, 2, 3],
            5: [1, 2, 3]
        }
    )

@check_types
def test_df_str_cols() -> DataFrame[Test]:
    return pd.DataFrame(
        {
            'other_column': ['unimportant', 'unimportant', 'unimportant'],
            1: [1, 2, 3],
            2: [1, 2, 3],
            3: [1, 2, 3],
            4: [1, 2, 3],
            '5': [1, 2, 3]
        }
    )

@check_types
def test_df_negative_values() -> DataFrame[Test]:
    return pd.DataFrame(
        {
            'other_column': ['unimportant', 'unimportant', 'unimportant'],
            1: [-1, -2, -3],
            2: [1, 2, 3],
            3: [1, 2, 3],
            4: [1, 2, 3],
            5: [1, 2, 3]
        }
    )

test_df_no_error()
print('no error until here')
try:
    test_df_str_cols()
except Exception as e:
    print(e)
try:
    test_df_negative_values()
except Exception as e:
    print(e)

results:

no error until here
Found unexpected column name: 5
Apart from expected other columns: ['other_column']
only columns with integers are allowed in this DataFrame.
error in check_types decorator of function 'test_df_negative_values': <Schema SeriesSchema(name=None, type=int)> failed element-wise validator 0:
<Check greater_than_or_equal_to: greater_than_or_equal_to(0)>
failure cases:
   index  failure_case
0      0            -1
1      1            -2
2      2            -3
cosmicBboy commented 3 years ago

hey @Christoph-1 @SebbanSms this is an interesting use case that pandera currently doesn't support very elegantly, the workaround you posted is the direction I was about to suggest!

Non-string column names do seem like an important thing to validate as part of a schema, currently pandera assumes (i) for regex=True that all column names are strings and (ii) non-string column names are known ahead of time.

What you are asking for is meta-validation of dataframe metadata (the types and allowable values of column name keys)... indeed the alias and regex options are sort of going in this direction.

I'd like to figure out an ergonomic way of doing this... here's a proposal off the top of my head:

Specify a type as column name

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={
        # column name key is a type, not a literal
        int: pa.Column(
            int,
            checks=[
                pa.Check.gt(0),  # check that values in the column are positive
                pa.Check(lambda s: s.name > 0)  # check that the column key itself is positive
            ]
        ),
        "other_column": pa.Column(str),
    },
    strict=True
)

df = pd.DataFrame({
    1: [0],  
    '2': [0],
    "other_column": ["foo"],
})

# SchemaModel version
class Schema(pa.SchemaModel)
    # alias argument is a type, not a literal
    integer_columns: pa.typing.Series[int] = Field(alias=int, gte=0)
    other_column: pa.typing.Series[str]

    @pa.check(int)
    def check_integer_column_keys(cls, series):
        return series.name > 0

    class Config:
        strict = True

This solution would match all integer column keys and apply the Column/Field checks to those columns. The "2" column name will be caught by the strict=True flag.

SebBanDev commented 3 years ago

@cosmicBboy looks like a good solution. You would need to change the implementation of the alias arg to accept types as input for this to work, right?

I'd propose to also be prepared for other pivot tables (in matrix format) as well, to also allow a list of column names as well.

E.g. if we have a DataFrame that has some price columns for products over time: date mp3_player smartphone computer ...
2021-01-01 25 190 450 ...
2021-01-02 30 185 500 ...
... ... ... ... ...

it would also be helpful to have a list of column names that have the same datatype and checks that can be used instead of blowing up the SchemaModel with repeated information

product_list = [mp3_player, smartphone, computer, ...]
...
product_price_matrix: pa.typing.Series[int]  = Field(alias=product_list, ge=0, ...)

Or in the integer use case, it would be cool to be able to further define the range of integers to be in column names. For example:

integer_list = range(1, 51)
specific_integer_columns: pa.typing.Series[int] = Field(alias=integer_list, gte=0)
cosmicBboy commented 3 years ago

btw, if you know the column keys before hand you can do this with DataFrameSchema:

pa.DataFrameSchema({key: Column(...) for key in range(1, 51)})

The code is more concise but the problem remains that the schema object is sort of over-specified with duplicated column objects.

Thinking about this problem more, I feel like piggy-backing on alias is a little bit of a hack... I think we can boil the problem down to pattern matching (sort of like in the structural pattern matching sense): we want to be able to match a Column/Index/Field specification to some key that we either:

  1. can enumerate explicitly in a list or iterable.
  2. can match by column name data types.

The regex option sort of does this, but only with strings. So here's an updated proposal:

Introduce matches option to match column names to an iterable or type

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={
        "integer_columns": pa.Column(..., matches=int),  # match a single type
        "number_columns": pa.Column(..., matches=Union[int, float]),  # match multiple types
        "product_column": pa.Column(..., matches="foo"),  # match a single literal, can be used in combination with regex=True
        "product_columns": pa.Column(..., matches=["foo", "bar", "baz"]),  # match multiple literals
    },
    strict=True
)

# SchemaModel version
class Schema(pa.SchemaModel)
    # alias argument is a type, not a literal
    integer_columns: pa.typing.Series[int] = Field(matches=int)
    number_columns: pa.typing.Series[int] = Field(matches=Union[int, float])
    product_column: pa.typing.Series[str] = Field(matches="foo")
    product_columns: pa.typing.Series[str] = Field(matches=["foo", "bar", "baz"])

In SchemaModel, specifying both the matches and alias option should raise a SchemaInitError.

In summary, these are the various execution paths:

  1. if matches not specified, use column name key
  2. if a str literal is provided e.g. matches="foo", use this to match column name (same behavior as alias)
  3. if a str literal is provided and regex=True, e.g. matches="\d+", expands column names using regex
  4. if a List[literal] is provided, match column names to this set of keys
  5. if a type is provided, match column names of that type
  6. if a List[type] is provided, match column names of those types.

@jeffzi thoughts?

jeffzi commented 3 years ago

Correct me if I'm wrong, but it sounds like there are 2 distinct problems:

  1. Selecting column names based on the type of the name: only string column names are allowed at the moment.

it would also be helpful to have a list of column names that have the same datatype...

  1. Selecting columns based on their dtype (content) , in order to apply specific checks to integers for example.

Regarding, 1. What's the use case for a mixed type column index? I'm guessing that situation arises when pivoting the dataframe. Pandas does not handle that too well:

from datetime import date
import pandas as pd

df = pd.DataFrame({1: [0], "2": [0], date(2021, 1, 1): [0]})
print(df.columns.dtype)  # mixed type falls back to object
#> object
print(df.columns.str.lower())  # str accessor has unexpected result
#> Index([nan, '2', nan], dtype='object')
print(df.columns.astype(str).str.lower())  # works as expected
#> Index(['1', '2', '2021-01-01'], dtype='object')

@cosmicBboy The matches argument would apply to column name types, right? I'm wondering if it's worth adding complexity to the API. I was suggesting that pandera internally casts column names to string (see last line of code above) and let users rely on regex to select the column names. They'd need to give a regex like \d* to select numbers, which should work as far as I know.

For the case mentioned by @SebbanSms :

df = pd.DataFrame({1: [0], '2': [0]}) the DataFrame should still raise an exception. In this case only for the column name '2' but not of the 1. Context: The code following working with the checked DataFrame expects the numbered column names to be of type int.

we want to select column names that are a number (regex solution works for that part) and also select columns that are of type integer.

Actually, some time ago I built a library to select pandas columns (inspired by R's dplyr). It's probably overkill for pandera but we could have Selector objects that can be combined with boolean operations:

import pandera as pa
from pandas_select import HasDtype, Match

# select columns whose names are a number and contain integers
pa.DataFrameSchema(
    columns={
        (HasDtype(int) & Match("\d+")): pa.Column(checks=...), # could create HasColumnNameType to replace Match,
        # selectors are optional
        "foo": pa.Column(...),
    },
)

Selectors could be used for other schema operations: update_columns, etc. That way we can have a zoo of selectors instead of editing the schema/column api as new cases turn up.

SebBanDev commented 3 years ago

it would also be helpful to have a list of column names that have the same datatype...

Sorry, my wording was not very precise there. I actually meant if columns share the same Field, so also the same checks and so forth.

Coming back to the price per product and time example: I may want to check that all prices are greater than zero, have no extreme fluctuation on a daily basis, and so forth. So I rather meant columns that carry the same type of data (not all columns having the same dtype) All price per product columns in the given example would probably have some basic checks that should be repeated for them. I think we still have to define which columns are price per product columns separately, e.g. by a list of product names

So selecting columns based on their dtype (content) would not be needed for this example.

jeffzi commented 3 years ago

You can already select a list of column names with a regex:

import pandas as pd
import pandera as pa

product_list = ["mp3_player", "smartphone", "computer"]
products = r"|".join(product_list)

schema = pa.DataFrameSchema({products: pa.Column(checks=[pa.Check.ge(0)], regex=True)})
schema.validate(pd.DataFrame({"smartphone": [-1]}))
#> Traceback (most recent call last):
#>  ...
#> SchemaError: <Schema Column(name=smartphone, type=None)> failed element-wise validator 0:
#> <Check greater_than_or_equal_to: greater_than_or_equal_to(0)>
#> failure cases:
#>    index  failure_case
#> 0      0            -1

I forgot to mention that a drawback of my Selector proposal is that we'd need an argument pa.Field(selectors=...), similar to @cosmicBboy 's matches. In my opinion, it's not worth introducing matches if it just enables the seemingly uncommon case of mixed typed column names, unless that case cannot be handled by the proposed internal str casting + regex.

cosmicBboy commented 3 years ago

okay, now I'm confused... @SebbanSms @Christoph-1 can you please confirm these use cases?

From my understanding, you want to:

  1. select columns based on the column name type, NOT the column content data type.
  2. make sure those selected columns are some type other than string

So I rather meant columns that carry the same type of data (not all columns having the same dtype)

Can you clarify this @SebbanSms? I can't tell what the contrast is.

I was suggesting that pandera internally casts column names to string (see last line of code above) and let users rely on regex to select the column names.

I think this would be the path of least resistance to support (1). Would need to update the code here and here to convert the column names to strings (making sure the original column name types are preserved).

For (2), a custom Check would cover that:

schema = pa.DataFrameSchema(
    columns={
        "\d+": pa.Column(
            int,
            # check that the column key is a positive int
            checks=pa.Check(lambda s: isinstance(s.name, int) and s.name > 0),
            regex=True,
        ),
        "other_column": pa.Column(str),
    },
    strict=True
)

As @jeffzi pointed out, this solution covers any case where a finite number of column names (of any type that can be cast into str) are known beforehand:

import pandas as pd
import pandera as pa

product_list = ["mp3_player", "smartphone", "computer"]
products = r"|".join(product_list)
integers = r"|".join(range(1, 51))

schema = pa.DataFrameSchema({
    products: pa.Column(checks=[pa.Check.ge(0)], regex=True)
    integers: pa.Column(..., regex=True)
})
SebBanDev commented 3 years ago

@cosmicBboy In our specific case making sure that columns that have a positive integer as column name share the same checks would be enough.


same type of data

Sorry once again for the imprecise language. With this I meant columns that should have the same checks (e.g. checks for all price per product columns)

Thanks for the example on how to do this with a list of string columns in combination with regex, I'll note it down for the future.

cosmicBboy commented 3 years ago

cool, so next step for this issue is:

to update the code here and here to convert the column names to strings (making sure the original column name types are preserved).