unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.37k stars 310 forks source link

SchemaError on index name for single-index dataframe when using SchemaModel #323

Closed kristomi closed 3 years ago

kristomi commented 3 years ago

Thanks for the awesome SchemaModel interface! That really improves readability and usability.

Describe the bug I am not able to validate a pandas DataFrame with a single index column when that index is named. The code on line 304 in pandera/model.py seems to explicitly force the schema to have a None-named index when the index only has one column. How can I get around this?

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandera as pa
import pandas as pd

df_invalid = pd.DataFrame(
    {"year": [2019, 2020], "value_1": [1, 2], "value_2": [2, 3]}
).set_index("year")
df_valid = pd.DataFrame({"value_1": [1, 2], "value_2": [2, 3]}, index=[2019, 2020])

class Schema(pa.SchemaModel):
    year: pa.typing.Index[int]
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]

Schema.validate(df_valid)
Schema.validate(df_invalid)

Expected behavior

I expect this to read the index name for the column, and accept the index. Alternatively that I can specify in the Config class that I want the schema to accept named indices.

Desktop (please complete the following information):

cosmicBboy commented 3 years ago

thanks for submitting this bug @kristomi, I think the rationale for this behavior is was that we didn't want users to have to name their indexes to specify valid dataframes (@jeffzi am I getting that correct?), although I think this behavior needs to be amended to support your use case.

Potential Solutions

  1. Skip the name check for single-indexed dataframes during validation altogether.
  2. Add Config option to accept named indices
class Schema(pa.SchemaModel):
    ...

    class Config:
        named_index: True  # default False
  1. Don't override index name with None in https://github.com/pandera-dev/pandera/blob/master/pandera/model.py#L295 and provide a special name for un-named single indexes, for example:
class Schema(pa.SchemaModel):
    __index__: pa.typing.Index[int]  # un-named index
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]

I'm sort of leaning toward 2 or 3... the nice thing about 2 is that it opens up support for un-named multiindex.

Thoughts @kristomi and @jeffzi?

kristomi commented 3 years ago

Thanks for the quick response.

The way I constructed the invalid dataframe is a very typical workflow for me, where dataframes are passed around and indices are set and reset all the time. On that background, I would prefer solution nr 3, because indices would then be validated as named by default, unless you construct the schema with your special notation.

jeffzi commented 3 years ago

I think the rationale for this behavior is was that we didn't want users to have to name their indexes to specify valid dataframes (@jeffzi am I getting that correct?)

Yes, that's right.

I also like the fact that 2. addresses unnamed multiindex and aligns the model api with the standard api. My issue with 3. is that it increases the complexity for new-comers. It would also be confusing in inherited models:

class Schema(pa.SchemaModel):
    __index__: pa.typing.Index[int]  # un-named index
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]

class SubSchema(Schema):
    # With current implementation, we would create a MultiIndex !
    #  What if we want to name the unnamed __index__?
    year: pa.typing.Index[int]  

On that background, I would prefer solution nr 3, because indices would then be validated as named by default, unless you construct the schema with your special notation.

If 3. is selected, I agree it would be reasonable, and perhaps more natural, to validate index name by default. After all, you do name the index.

Edit: I meant "If 2. is selected".

kristomi commented 3 years ago

You guys have way more insight than I do, and I clearly see the problem with inheritance now that you mention it. Perhaps nr 2 is a better choice, then. Anyway, I'm not in a position to see all the implications of solving this one way or the other.

jeffzi commented 3 years ago

@kristomi np! It's already a great help to report bugs and feedback πŸ™‚

@cosmicBboy Once a decision has been reached, I'd be happy to take care of the changes. I saw you already have a lot on your plate.

cosmicBboy commented 3 years ago

thanks for the feedback! After thinking about it for a little bit, I'd like to go for solution # 2, with a slight addition:

Introduce a verify_name kwarg to the SeriesSchemaBase class constructor and verify_names kwarg in MultiIndex constructor

My reasoning for this is:

class Schema(pa.SchemaModel):
    year: pa.typing.Index[int]  # by default, pa.Index(verify_name=False) so pandera won't check the "year" name here
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]

class SubSchema(Schema):
    month: pa.typing.Index[int]  # by default, pa.MultiIndex(verify_names=True) so pandera will construct multiindex here and check index names
cosmicBboy commented 3 years ago

Once a decision has been reached, I'd be happy to take care of the changes.

Thanks @jeffzi! Let me know what you think about the above proposal and if you have any questions/concerns about it

jeffzi commented 3 years ago

I would suggest check_name instead of verify_name to keep the same vocabulary used elsewhere in pandera.

Name validation is mandatory for Series because we need the name to get the column to validate from the DataFrame: https://github.com/pandera-dev/pandera/blob/31fe07b30b267527ecea7f6b1435f49b6b963abf/pandera/schemas.py#L937-L941

That would not be an issue if pandera supported columns order. It is something that I actually wanted to suggest for the machine learning use case. Many ML libraries ignore the column names and rely on the order (possibly casting to a numpy array).

False by default in the Index, and False (I think?) by default in MultiIndex

Your example says it should be True by default :question:
Personally I only give a name to a single index when I set it explicitly, which is unnecessary most of the time. I always give a name to a MultiIndex though. True by default for MultiIndex is also closer to pandera's standard API.

I agree with your other points.

cosmicBboy commented 3 years ago

I would suggest check_name instead of verify_name

sounds good πŸ‘

Your example says it should be True by default

woops, yes I meant True by default :)