SchemaError on index name for single-index dataframe when using SchemaModel

kristomi commented 3 years ago

Thanks for the awesome SchemaModel interface! That really improves readability and usability.

Describe the bug I am not able to validate a pandas DataFrame with a single index column when that index is named. The code on line 304 in pandera/model.py seems to explicitly force the schema to have a None-named index when the index only has one column. How can I get around this?

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[] (optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandera as pa
import pandas as pd

df_invalid = pd.DataFrame(
    {"year": [2019, 2020], "value_1": [1, 2], "value_2": [2, 3]}
).set_index("year")
df_valid = pd.DataFrame({"value_1": [1, 2], "value_2": [2, 3]}, index=[2019, 2020])

class Schema(pa.SchemaModel):
    year: pa.typing.Index[int]
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]

Schema.validate(df_valid)
Schema.validate(df_invalid)

Expected behavior

I expect this to read the index name for the column, and accept the index. Alternatively that I can specify in the Config class that I want the schema to accept named indices.

Desktop (please complete the following information):

OS: MacOS
Browser Chrome
Version 86

cosmicBboy commented 3 years ago

thanks for submitting this bug @kristomi, I think the rationale for this behavior is was that we didn't want users to have to name their indexes to specify valid dataframes (@jeffzi am I getting that correct?), although I think this behavior needs to be amended to support your use case.

Potential Solutions

Skip the name check for single-indexed dataframes during validation altogether.
Add Config option to accept named indices

class Schema(pa.SchemaModel):
    ...

    class Config:
        named_index: True  # default False

Don't override index name with None in https://github.com/pandera-dev/pandera/blob/master/pandera/model.py#L295 and provide a special name for un-named single indexes, for example:

class Schema(pa.SchemaModel):
    __index__: pa.typing.Index[int]  # un-named index
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]

I'm sort of leaning toward 2 or 3... the nice thing about 2 is that it opens up support for un-named multiindex.

Thoughts @kristomi and @jeffzi?

kristomi commented 3 years ago

Thanks for the quick response.

The way I constructed the invalid dataframe is a very typical workflow for me, where dataframes are passed around and indices are set and reset all the time. On that background, I would prefer solution nr 3, because indices would then be validated as named by default, unless you construct the schema with your special notation.

jeffzi commented 3 years ago

I think the rationale for this behavior is was that we didn't want users to have to name their indexes to specify valid dataframes (@jeffzi am I getting that correct?)

Yes, that's right.

I also like the fact that 2. addresses unnamed multiindex and aligns the model api with the standard api. My issue with 3. is that it increases the complexity for new-comers. It would also be confusing in inherited models:

class Schema(pa.SchemaModel):
    __index__: pa.typing.Index[int]  # un-named index
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]

class SubSchema(Schema):
    # With current implementation, we would create a MultiIndex !
    #  What if we want to name the unnamed __index__?
    year: pa.typing.Index[int]

On that background, I would prefer solution nr 3, because indices would then be validated as named by default, unless you construct the schema with your special notation.

If 3. is selected, I agree it would be reasonable, and perhaps more natural, to validate index name by default. After all, you do name the index.

Edit: I meant "If 2. is selected".

kristomi commented 3 years ago

You guys have way more insight than I do, and I clearly see the problem with inheritance now that you mention it. Perhaps nr 2 is a better choice, then. Anyway, I'm not in a position to see all the implications of solving this one way or the other.

jeffzi commented 3 years ago

@kristomi np! It's already a great help to report bugs and feedback 🙂

@cosmicBboy Once a decision has been reached, I'd be happy to take care of the changes. I saw you already have a lot on your plate.

cosmicBboy commented 3 years ago

thanks for the feedback! After thinking about it for a little bit, I'd like to go for solution # 2, with a slight addition:

Introduce a `verify_name` kwarg to the `SeriesSchemaBase` class constructor and `verify_names` kwarg in `MultiIndex` constructor

verify_name(s) should be an instance property.
If True, verify the name of the Series/Index/MultiIndex, otherwise don't.
This would be True by default in SeriesSchemaBase and SeriesSchema, False by default in the Index, and False (I think?) by default in MultiIndex
The Field field should also add this new kwarg.
The SchemaModel Config should have multi_index__verify_names kwarg
For MultiIndex, if verify_names==False, modify the checked dataframe here so that the column names are 0-indexed integers. (Note: MultiIndex inherits from DataFrameSchema and uses its underlying validation logic by casting the pd.MultiIndex into a DataFrame... a questionable design decision perhaps, but a vestigial part of the early days of this library when I was lazy and didn't want to re-implement validation for multiindexes 🙃)

My reasoning for this is:

I agree with @jeffzi that I'd prefer not to introduce a special concept for un-named indexes, as it degrades usability
The MultiIndex object already handles un-named MultiIndexes provided by the user in the validated object by falling back to 0-indexed integer column names. verify_names just adds granular control on whether the user wants to explicitly check these names on the MultiIindex.validate call.
This keeps the API tight in the sense that users can control the name-checking behavior of both single indexes and multiindexes, while providing reasonable defaults... typically, I don't really care whether a single index has the correct name, but if I do I just specify pa.Index(verify_name=True) or Field(verify_name=True) in the class-based API. This setup should also support the df_invalid in your example code @kristomi and gracefully handle the inheritance case:

class Schema(pa.SchemaModel):
    year: pa.typing.Index[int]  # by default, pa.Index(verify_name=False) so pandera won't check the "year" name here
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]

class SubSchema(Schema):
    month: pa.typing.Index[int]  # by default, pa.MultiIndex(verify_names=True) so pandera will construct multiindex here and check index names

cosmicBboy commented 3 years ago

Once a decision has been reached, I'd be happy to take care of the changes.

Thanks @jeffzi! Let me know what you think about the above proposal and if you have any questions/concerns about it

jeffzi commented 3 years ago

I would suggest check_name instead of verify_name to keep the same vocabulary used elsewhere in pandera.

Name validation is mandatory for Series because we need the name to get the column to validate from the DataFrame: https://github.com/pandera-dev/pandera/blob/31fe07b30b267527ecea7f6b1435f49b6b963abf/pandera/schemas.py#L937-L941

That would not be an issue if pandera supported columns order. It is something that I actually wanted to suggest for the machine learning use case. Many ML libraries ignore the column names and rely on the order (possibly casting to a numpy array).

False by default in the Index, and False (I think?) by default in MultiIndex

Your example says it should be True by default :question:
Personally I only give a name to a single index when I set it explicitly, which is unnecessary most of the time. I always give a name to a MultiIndex though. True by default for MultiIndex is also closer to pandera's standard API.

I agree with your other points.

cosmicBboy commented 3 years ago

I would suggest check_name instead of verify_name

sounds good 👍

Your example says it should be True by default

woops, yes I meant True by default :)

unionai-oss / pandera