unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Pandera-Schema Model-Showing incorrect results for Integer column validations #821

Closed learnnk closed 2 years ago

learnnk commented 2 years ago

Hi Team, I'm using the Pandera Schema model to validate my file which as Integer column.

My Schema Model class Retaildata(pa.SchemaModel): orderid: Series[int] = pa.Field(coerce=True,nullable=False) orderdate: Series[pa.DateTime] = pa.Field(coerce=True) orderamount: Series[int] = pa.Field(coerce=True) orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)

Data orderid, orderdate, orderamount, orderstatus 0 1, 2013-07-25 00:00:00.0,11599, PENDING_PAYMENT 1 -1, 2013-07-25 00:00:00.0, 256, PENDING_PAYMENT 2 abc, 2013-07-25 00:00:00.0, 12111, COMPLETE output schema_context column ... failure_case index 0 Column orderid ... abc 2

[1 rows x 6 columns]

To test this column

  1. we are using the String value for one row
  2. we are using the (Negative) value for another row

Ouput- It always shows the failure case as String value-index number(which is point 1) not showing the Index value for (point 2) from above

Can you please suggest/guide how to capture the index of the failure case for negative row which ( is index 1 from data)

cosmicBboy commented 2 years ago

Hi @learnnk, can you please provide a minimally reproducible code snippet with your (i) schema and (ii) data... basically what you've done above except in an easily copy-pasteable form (and the unexpected error/output you're seeing). This will help us help you 🤝

learnnk commented 2 years ago

Thanks for your response Here is the code snippet

import pandera as pa import pandas as pd from pandera.typing import Index, DataFrame, Series from pandera import Check, Column, DataFrameSchema

class Retaildata(pa.SchemaModel): orderid: Series[int] = pa.Field(coerce=True,nullable=False,gt=0) orderdate: Series[pa.DateTime] = pa.Field(coerce=True) orderamount: Series[int] = pa.Field(coerce=True) orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)

df = pd.DataFrame({ "orderid": ["1", "-1", "abc"], "orderdate": ["2013-07-25", "2013-07-27", "2013-07-28"], "orderamount": ["1234", "156", "365"], "orderstatus":["PENDING_PAYMENT","COMPLETE","CLOSED"] })

try:

schema.validate(data_frame, lazy=True)

Retaildata.validate(df, lazy=True)

except pa.errors.SchemaErrors as err: print(err.failure_cases, '\n')

output getting as below schema_context column ... failure_case index 0 Column orderid ... abc 2

[1 rows x 6 columns]

As i said i'm not getting failure case for negative value in orderid for index 1

cosmicBboy commented 2 years ago

FYI you can do triple backticks ``` to preserve indentation:

import pandera as pa
import pandas as pd
from pandera.typing import Index, DataFrame, Series
from pandera import Check, Column, DataFrameSchema

class Retaildata(pa.SchemaModel):
    orderid: Series[int] = pa.Field(coerce=True,nullable=False,gt=0)
    orderdate: Series[pa.DateTime] = pa.Field(coerce=True)
    orderamount: Series[int] = pa.Field(coerce=True)
    orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)

this makes it easier to copy-paste into (in this case) my editor to help debug

cosmicBboy commented 2 years ago

ah, you're seeing just the coercion error (failure to coerce "abc" to an int)

  schema_context   column                  check check_number failure_case  index
0         Column  orderid  coerce_dtype('int64')         None          abc      2

because pandera doesn't apply the gt=0 constraint check for the following reason: currently, failure to coerce a column to the specified dtype will short-circuit the rest of the property checks... the original reasoning for this is the following: if a column cannot even be coerced to the correct type, then it would be fair to assume that we cannot apply the check constraint (which assumes the column to be of a certain type) in the first place.

This might be too strict an assumption, but I'd be open to changing this behavior if you and others find this unintuitive.

In any case, it would be worth adding an explanation of this behavior in the lazy validation docs... https://pandera.readthedocs.io/en/stable/lazy_validation.html

I'm curious: what is your specific use case, and what are the ways you're using pandera/pandas to clean/validate data? E.g. is this a programmatic data pipeline, or are you using pandera in the context of a GUI where a human updates the state of the data with multiple rounds of validation?

learnnk commented 2 years ago

Thanks for quick update Heres is my response to your question

I'm curious: what is your specific use case, and what are the ways you're using pandera/pandas to clean/validate data? E.g. is this a programmatic data pipeline, or are you using pandera in the context of a GUI where a human updates the state of the data with multiple rounds of validation?

[learnnk]-We are using Pandera for DataQuality Validation. We built a in house framework which automatically generates the schema model based on my schema(Flat File/DB) and we will validate the schema using Pandera in built functions so as part of this we are using to validate daily 100+ files with 200-300 millions of records with 200+ columns Automatically these data we will feed to GUI/Generating the reports

It would be better if this use case suffice with out having the dependency of column-coerece type I'm expecting -failure case should throw both(index 1 and index 2) for my data

Thanks!!

cosmicBboy commented 2 years ago

cool, so if I understand correctly, this is a summary of your situation:

  1. you have a flat file/DB with raw data represented as strings
  2. you want to validate numeric properties of certain columns, but first you need to convert the strings to ints
  3. you want to know (a) which values cannot be converted to ints (b) which values fulfill certain numeric properties (e.g. gt=0)

my next question is: do you care that the resulting valid data are of the expected dtypes?

learnnk commented 2 years ago

Here is my response 1.you have a flat file/DB with raw data represented as strings [learnnk]-Its my bad. My file will not have entire string data.it’s mixture of string and int data My ask - for integer column we want to make sure it has only integer data. When we use this schema model and validated against the data Scenario If integer column has alphanumeric value and another value which does not fall for the given range. It throws Error only for alphanumeric not for the range which I have given

output getting as below schema_context column ... failure_case index 0 Column orderid ... abc 2

Thanks!

cosmicBboy commented 2 years ago

for integer column we want to make sure it has only integer data. When we use this schema model and validated against the data If integer column has alphanumeric value and another value which does not fall for the given range. It throws Error only for alphanumeric not for the range which I have given

Gotcha, so as I mentioned, pandera can't do both of these things at the same time for reasons stated here https://github.com/pandera-dev/pandera/issues/821#issuecomment-1087080411

Recommendation 1: Custom Check with no Dtype, or Any dtype.

You can also forego specifying the integer datatype and create a custom check that (a) finds uncoercible values and (b) applies the range constraint validation. This would mean that you can't coerce non-integer values to integers, but you'd be able to catch all the cases in one go, for example:

class Retaildata(pa.SchemaModel):
    orderid: Series[Any] = pa.Field(nullable=False)
    orderdate: Series[pa.DateTime] = pa.Field(coerce=True)
    orderamount: Series[int] = pa.Field(coerce=True)
    orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)

    @pa.check("orderid")
    def check_is_int(cls, series):
        """Check that values are integers"""
        return series.map(lambda x: isinstance(x, int))

    @pa.check("orderid")
    def check_positive(cls, series):
        """Check that integer values are positive, ignoring non-integers"""
        mask = series.map(lambda x: not isinstance(x, int))
        return (series.mask(mask) > 0) | mask

This'll give you the error report you want:

  schema_context   column           check  check_number failure_case  index
0         Column  orderid    check_is_int             0          abc      2
1         Column  orderid  check_positive             1           -1      1

However, it will not coerce the datatype of orderid.

Recommendation 2: Custom Dtype

If you care about the coercion to integers, you can also implement a custom data type that implements a coerce method to handle values of the wrong type before applying the range constriants, see here for more details on implementing custom dtypes.

I also offer consulting services for these kinds of custom use-cases, I'd be happy to work with you and your company to help you implement these solutions! Please reach out at niels.bantilan@gmail.com if you're interested.

cosmicBboy commented 2 years ago

converting this to a discussion, @learnnk feel free to mark continue the discussion there!