unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

`P.STRING` index not coercible with pandera[modin-ray] #679

Closed zevisert closed 2 years ago

zevisert commented 2 years ago

Describe the bug A clear and concise description of what the bug is.

P.STRING fails validation / coercion when using pandera[modin-ray].

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import modin.pandas as pd
import pandera as pa
import pandera.typing as P

class Example(pa.SchemaModel):
    strings: P.Index[P.STRING]

    class Config:
        coerce = True

@pa.decorators.check_types()
def main() -> P.DataFrame[Example]:
    return pd.DataFrame(
        index=[
            "should",
            "not",
            "throw",
            "during",
            "schema",
            "validate",
        ],
    )

if __name__ == "__main__":
    main()

Expected behavior

A clear and concise description of what you expected to happen.

When using pandera[modin-ray], we should be able to coerce "object"-strings to pandas.StringDtype strings, even in an index.

Desktop (please complete the following information):

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Repro repository is available here:

git clone git@github.com:zevisert/pandera-modin-index-type-repro
cd pandera-modin-index-type-repro
poetry install
poetry run repro
cosmicBboy commented 2 years ago

hi @zevisert, it doesn't look like modin can handle string[python] dtypes for indexes (it'll just silently convert it to object)

index = pd.Index(
    [
        "should",
        "not",
        "throw",
        "during",
        "schema",
        "validate",
    ],
    dtype="string[python]"
)
series = pd.Series(
    [
        "should",
        "not",
        "throw",
        "during",
        "schema",
        "validate",
    ],
    dtype="string[python]"
)
print("Index:", index)
print("Series:", series)

output:

Index: Index(['should', 'not', 'throw', 'during', 'schema', 'validate'], dtype='object')
Series: 0      should
1         not
2       throw
3      during
4      schema
5    validate
dtype: string

You can specify the str or object types in your schema, which should work just fine:

class Example(pa.SchemaModel):
    index: P.Index[str]

    class Config:
        coerce = True
cosmicBboy commented 2 years ago

would probably make sense to raise an error for cases like these where the pandas-like implementation (modin, koalas, etc.) silently casts types to some fallback type

zevisert commented 2 years ago

Interesting, I didn't notice that this was on modin. Thanks!