Closed kristomi closed 3 years ago
thanks for submitting this bug @kristomi, I think the rationale for this behavior is was that we didn't want users to have to name their indexes to specify valid dataframes (@jeffzi am I getting that correct?), although I think this behavior needs to be amended to support your use case.
Config
option to accept named indicesclass Schema(pa.SchemaModel):
...
class Config:
named_index: True # default False
None
in https://github.com/pandera-dev/pandera/blob/master/pandera/model.py#L295 and provide a special name for un-named single indexes, for example:class Schema(pa.SchemaModel):
__index__: pa.typing.Index[int] # un-named index
value_1: pa.typing.Series[int]
value_2: pa.typing.Series[int]
I'm sort of leaning toward 2 or 3... the nice thing about 2 is that it opens up support for un-named multiindex.
Thoughts @kristomi and @jeffzi?
Thanks for the quick response.
The way I constructed the invalid dataframe is a very typical workflow for me, where dataframes are passed around and indices are set and reset all the time. On that background, I would prefer solution nr 3, because indices would then be validated as named by default, unless you construct the schema with your special notation.
I think the rationale for this behavior is was that we didn't want users to have to name their indexes to specify valid dataframes (@jeffzi am I getting that correct?)
Yes, that's right.
I also like the fact that 2. addresses unnamed multiindex and aligns the model api with the standard api. My issue with 3. is that it increases the complexity for new-comers. It would also be confusing in inherited models:
class Schema(pa.SchemaModel):
__index__: pa.typing.Index[int] # un-named index
value_1: pa.typing.Series[int]
value_2: pa.typing.Series[int]
class SubSchema(Schema):
# With current implementation, we would create a MultiIndex !
# What if we want to name the unnamed __index__?
year: pa.typing.Index[int]
On that background, I would prefer solution nr 3, because indices would then be validated as named by default, unless you construct the schema with your special notation.
If 3. is selected, I agree it would be reasonable, and perhaps more natural, to validate index name by default. After all, you do name the index.
Edit: I meant "If 2. is selected".
You guys have way more insight than I do, and I clearly see the problem with inheritance now that you mention it. Perhaps nr 2 is a better choice, then. Anyway, I'm not in a position to see all the implications of solving this one way or the other.
@kristomi np! It's already a great help to report bugs and feedback π
@cosmicBboy Once a decision has been reached, I'd be happy to take care of the changes. I saw you already have a lot on your plate.
thanks for the feedback! After thinking about it for a little bit, I'd like to go for solution # 2, with a slight addition:
verify_name
kwarg to the SeriesSchemaBase
class constructor and verify_names
kwarg in MultiIndex
constructorverify_name(s)
should be an instance property.SeriesSchemaBase
and SeriesSchema
, False by default in the Index
, and False (I think?) by default in MultiIndex
Field
field should also add this new kwarg.multi_index__verify_names
kwargMultiIndex
, if verify_names==False
, modify the checked dataframe here so that the column names are 0-indexed integers. (Note: MultiIndex
inherits from DataFrameSchema
and uses its underlying validation logic by casting the pd.MultiIndex
into a DataFrame
... a questionable design decision perhaps, but a vestigial part of the early days of this library when I was lazy and didn't want to re-implement validation for multiindexes π)My reasoning for this is:
verify_names
just adds granular control on whether the user wants to explicitly check these names on the MultiIindex.validate
call.pa.Index(verify_name=True)
or Field(verify_name=True)
in the class-based API. This setup should also support the df_invalid
in your example code @kristomi and gracefully handle the inheritance case:class Schema(pa.SchemaModel):
year: pa.typing.Index[int] # by default, pa.Index(verify_name=False) so pandera won't check the "year" name here
value_1: pa.typing.Series[int]
value_2: pa.typing.Series[int]
class SubSchema(Schema):
month: pa.typing.Index[int] # by default, pa.MultiIndex(verify_names=True) so pandera will construct multiindex here and check index names
Once a decision has been reached, I'd be happy to take care of the changes.
Thanks @jeffzi! Let me know what you think about the above proposal and if you have any questions/concerns about it
I would suggest check_name
instead of verify_name
to keep the same vocabulary used elsewhere in pandera.
Name validation is mandatory for Series because we need the name to get the column to validate from the DataFrame: https://github.com/pandera-dev/pandera/blob/31fe07b30b267527ecea7f6b1435f49b6b963abf/pandera/schemas.py#L937-L941
That would not be an issue if pandera supported columns order. It is something that I actually wanted to suggest for the machine learning use case. Many ML libraries ignore the column names and rely on the order (possibly casting to a numpy array).
False by default in the Index, and False (I think?) by default in MultiIndex
Your example says it should be True by default :question:
Personally I only give a name to a single index when I set it explicitly, which is unnecessary most of the time. I always give a name to a MultiIndex though. True by default for MultiIndex is also closer to pandera's standard API.
I agree with your other points.
I would suggest
check_name
instead ofverify_name
sounds good π
Your example says it should be True by default
woops, yes I meant True by default :)
Thanks for the awesome
SchemaModel
interface! That really improves readability and usability.Describe the bug I am not able to validate a pandas DataFrame with a single index column when that index is named. The code on line 304 in
pandera/model.py
seems to explicitly force the schema to have aNone
-named index when the index only has one column. How can I get around this?Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Expected behavior
I expect this to read the index name for the column, and accept the index. Alternatively that I can specify in the
Config
class that I want the schema to accept named indices.Desktop (please complete the following information):