unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

`validate` is slow with when coercing several hundreds columns. #1071

Open koalp opened 1 year ago

koalp commented 1 year ago

Describe the bug

Validating against a SchemaModel with several hundred is used with coerce takes a lot of time, even if the dataframe is already valid. It doesn’t occur when there is no coerce.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
import numpy as np
from pandera.typing import Series

class TestCoerce(pa.SchemaModel):
    a: Series[float] = pa.Field(alias="a\d+", regex=True, coerce=True)

class TestNoCoerce(pa.SchemaModel):
    a: Series[float] = pa.Field(alias="a\d+", regex=True, coerce=False)

def gen_df(
    value: float = 1.618,
    col_number: int = 40,
    row_number: int = int(1e6),
):

    return pd.DataFrame(
        {
            f"a{i}": np.full(row_number, value)
            for i in range(col_number)
        }
    )

df = gen_df()
TestCoerce.validate(df)

In this gist you will find a script that compares execution time with and without coerce : https://gist.github.com/koalp/0e70303c014712a6f7f790b5743482a3

Expected behavior

That the coercion doesn’t take so much time when the dtype is already good. It would be even better to not be slow when all the columns must be converted.

Desktop (please complete the following information):

Additional context

After running benchmarks, I found out that the __setattr__ function¹ from pandas (replacing a column) takes a lot of time to run. (python 3.9) If I modify pandera to only setattr it the result from try_coercion differs from the previous column it solves my issue as I currently only have 1 or less column that need to be changed (wrong dtype). However, it isn’t a generic solution as it doesn’t help when a lot of columns have a wrong dtype.

On discord, a modification was suggested:

I think an alternative and potentially faster solution would be to check if the dtype of obj[matched_colname] is the same as col_schema.dtype. If so, then coercion isn't necessary. If not, then apply coercion and reassign the column.

cosmicBboy commented 1 year ago

Thanks for opening this @koalp ! I think a good solution here is to check if the type of the incoming data matches the expected type, and only coercing/re-assigning columns that don't match.

Will circle back to this issue once https://github.com/unionai-oss/pandera/pull/913 is merged