unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Name of single index is None in SchemaModel #867

Closed the-matt-morris closed 2 years ago

the-matt-morris commented 2 years ago

Love the work this library is enabling!

Describe the bug A clear and concise description of what the bug is.

Code Sample, a copy-pastable example

from pandera import SchemaModel
from pandera.typing import Index, Series, String

# Contrived schema with single index, 1 column
class MySchema(SchemaModel):

    # Index
    foo: Index[String]

    # Column(s)
    bar: Series[String]

# Attempt to see name of index
MySchema.to_schema().index
<Schema Index(name=None, type=DataType(str))>

The names are kept when using a multi-index, but not when a single index is specified, as above.

Expected behavior

The name attribute of the index should be foo in the above example.

Desktop (please complete the following information):

Workaround

There's gotta be a better way to do this, but here's my hacky way to get this to work for now:

from pandera import SchemaModel as PanderaSchemaModel
from pandera.typing import String, Index, Series

class SchemaModel(PanderaSchemaModel):

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)

        # Populate cls.__schema__
        cls.to_schema()

        if (
            (index := getattr(cls.__schema__, "index", None)) and
            (index.name is None)
        ):
            # Find name of index, assuming it is the only name from the list
            # of fields that is not present in columns
            for field in cls.__fields__:
                if field not in cls.__schema__.columns:
                    cls.__schema__.index._name = field
                    break

# Contrived schema
class MySchema(SchemaModel):

    # Index
    foo: Index[String]

    # Column(s)
    bar: Series[String]

# Attempt to see name of index
MySchema.to_schema().index
<Schema Index(name=foo, type=DataType(str))>
hoffch commented 2 years ago

I can verify this issue with pandera 0.11.0. Pretty annoying. Besides that: awesome package!

cosmicBboy commented 2 years ago

This need to be documented better, but you need to supply the check_name=True argument to pa.Field in order to preserve single-index schema metadata when converting to_schema.

See example here

The API reference has a more complete description: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.model_components.Field.html#pandera.model_components.Field

check_name (Optional[bool]) – Whether to check the name of the column/index during validation. None is the default behavior, which translates to True for columns and multi-index, and to False for a single index.

This is the default behavior because, in many cases, single-index dataframes are often not named, and there's no way to have an un-named index in SchemaModels. This caused an issue where validation would fail since the SchemaModels with indexes would try to validate some index name (e.g. foo in the issue description), see #326.

Hence the check_name=None arg has different behavior depending on single or multi-array indexes.

I can verify this issue with pandera 0.11.0. Pretty annoying.

Any chance you want to channel that energy to a PR with an example in the docs somewhere on this page @hoffch ?? 😀

the-matt-morris commented 2 years ago

@cosmicBboy , thank you for the detailed explanation on this. I won't be able to get to it right away, but I can submit a PR with example in the docs.

hoffch commented 2 years ago

@cosmicBboy Thanks for the clarification, the rationale is pretty convincing. Unfortunately, I can't contribute a PR in the forseeable future. Double thanks to @the-matt-morris for doing so instead of me!