unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.42k stars 311 forks source link

Parse function of pa.DataFrameModel is called twice #1842

Open TimotejPalus opened 1 month ago

TimotejPalus commented 1 month ago

Describe the bug Hello, It seems like the parse function is called twice for a specified column of given pandas dataframe. Please check sample code and sample output.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

Slightly modified example from https://pandera.readthedocs.io/en/stable/parsers.html#parsers-in-dataframemodel

import pandas as pd
import pandera as pa

data = pd.DataFrame({
    "a": [2.0, 4.0, 9.0],
    "b": [2.0, 4.0, 9.0],
    "c": [2.0, 4.0, 9.0],
})

class DFModel(pa.DataFrameModel):
    a: float
    b: float
    c: float

    @pa.parser("b")
    def negate(cls, series):
        print(series)
        return series

DFModel.validate(data)
Printed to console
``` 0 2.0 1 4.0 2 9.0 Name: b, dtype: float64 0 2.0 1 4.0 2 9.0 Name: b, dtype: float64 ```

Expected behavior

From what is printed to the console it is obvious that the negate is run twice. I would expect for the parser to be run once. I was not able to find in the documentation why this is so. From what i have googled i found similar issue: https://github.com/unionai-oss/pandera/issues/1707

Additional context

pandera version: '0.20.4'

Thank you very much :)

TimotejPalus commented 1 month ago

It seems like it is run twice, but in the resultatn pd.Dataframe only the output from the first run of the parser ispresent:

code:

data = pd.DataFrame({
    "a": [2.0, 4.0, 9.0],
    "b": [2.0, 4.0, 9.0],
    "c": [2.0, 4.0, 9.0],
})

class DFModel(pa.DataFrameModel):
    a: float
    b: float
    c: float

    @pa.parser("b")
    def negate(cls, series):
        print('\n -------------',f'\nbefore parsing: {series.tolist()}', f'\nafter parsing: {(series + 1).tolist()}')
        return series + 1

data = DFModel.validate(data)
print('\n -------------',f'\nResulting "b" column in the "data" pd.DataFrame: {data["b"].tolist()}')

console:

 ------------- 
before parsing: [2.0, 4.0, 9.0] 
after parsing: [3.0, 5.0, 10.0]
 ------------- 
before parsing: [3.0, 5.0, 10.0] 
after parsing: [4.0, 6.0, 11.0]
 ------------- 
Resulting "b" column in the "data" pd.DataFrame: [3.0, 5.0, 10.0]
Girmii commented 3 weeks ago

Also mentioned in #1684