Closed learnnk closed 2 years ago
Hi @learnnk, can you please provide a minimally reproducible code snippet with your (i) schema and (ii) data... basically what you've done above except in an easily copy-pasteable form (and the unexpected error/output you're seeing). This will help us help you 🤝
Thanks for your response Here is the code snippet
import pandera as pa import pandas as pd from pandera.typing import Index, DataFrame, Series from pandera import Check, Column, DataFrameSchema
class Retaildata(pa.SchemaModel): orderid: Series[int] = pa.Field(coerce=True,nullable=False,gt=0) orderdate: Series[pa.DateTime] = pa.Field(coerce=True) orderamount: Series[int] = pa.Field(coerce=True) orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)
df = pd.DataFrame({ "orderid": ["1", "-1", "abc"], "orderdate": ["2013-07-25", "2013-07-27", "2013-07-28"], "orderamount": ["1234", "156", "365"], "orderstatus":["PENDING_PAYMENT","COMPLETE","CLOSED"] })
try:
Retaildata.validate(df, lazy=True)
except pa.errors.SchemaErrors as err: print(err.failure_cases, '\n')
output getting as below schema_context column ... failure_case index 0 Column orderid ... abc 2
[1 rows x 6 columns]
As i said i'm not getting failure case for negative value in orderid for index 1
FYI you can do triple backticks ``` to preserve indentation:
import pandera as pa
import pandas as pd
from pandera.typing import Index, DataFrame, Series
from pandera import Check, Column, DataFrameSchema
class Retaildata(pa.SchemaModel):
orderid: Series[int] = pa.Field(coerce=True,nullable=False,gt=0)
orderdate: Series[pa.DateTime] = pa.Field(coerce=True)
orderamount: Series[int] = pa.Field(coerce=True)
orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)
this makes it easier to copy-paste into (in this case) my editor to help debug
ah, you're seeing just the coercion error (failure to coerce "abc"
to an int)
schema_context column check check_number failure_case index
0 Column orderid coerce_dtype('int64') None abc 2
because pandera doesn't apply the gt=0
constraint check for the following reason: currently, failure to coerce a column to the specified dtype will short-circuit the rest of the property checks... the original reasoning for this is the following: if a column cannot even be coerced to the correct type, then it would be fair to assume that we cannot apply the check constraint (which assumes the column to be of a certain type) in the first place.
This might be too strict an assumption, but I'd be open to changing this behavior if you and others find this unintuitive.
In any case, it would be worth adding an explanation of this behavior in the lazy validation docs... https://pandera.readthedocs.io/en/stable/lazy_validation.html
I'm curious: what is your specific use case, and what are the ways you're using pandera/pandas to clean/validate data? E.g. is this a programmatic data pipeline, or are you using pandera in the context of a GUI where a human updates the state of the data with multiple rounds of validation?
Thanks for quick update Heres is my response to your question
I'm curious: what is your specific use case, and what are the ways you're using pandera/pandas to clean/validate data? E.g. is this a programmatic data pipeline, or are you using pandera in the context of a GUI where a human updates the state of the data with multiple rounds of validation?
[learnnk]-We are using Pandera for DataQuality Validation. We built a in house framework which automatically generates the schema model based on my schema(Flat File/DB) and we will validate the schema using Pandera in built functions so as part of this we are using to validate daily 100+ files with 200-300 millions of records with 200+ columns Automatically these data we will feed to GUI/Generating the reports
It would be better if this use case suffice with out having the dependency of column-coerece type I'm expecting -failure case should throw both(index 1 and index 2) for my data
Thanks!!
cool, so if I understand correctly, this is a summary of your situation:
gt=0
)my next question is: do you care that the resulting valid data are of the expected dtypes?
Here is my response 1.you have a flat file/DB with raw data represented as strings [learnnk]-Its my bad. My file will not have entire string data.it’s mixture of string and int data My ask - for integer column we want to make sure it has only integer data. When we use this schema model and validated against the data Scenario If integer column has alphanumeric value and another value which does not fall for the given range. It throws Error only for alphanumeric not for the range which I have given
output getting as below schema_context column ... failure_case index 0 Column orderid ... abc 2
Thanks!
for integer column we want to make sure it has only integer data. When we use this schema model and validated against the data If integer column has alphanumeric value and another value which does not fall for the given range. It throws Error only for alphanumeric not for the range which I have given
Gotcha, so as I mentioned, pandera
can't do both of these things at the same time for reasons stated here https://github.com/pandera-dev/pandera/issues/821#issuecomment-1087080411
Any
dtype.You can also forego specifying the integer
datatype and create a custom check that (a) finds uncoercible values and (b) applies the range constraint validation. This would mean that you can't coerce non-integer values to integers, but you'd be able to catch all the cases in one go, for example:
class Retaildata(pa.SchemaModel):
orderid: Series[Any] = pa.Field(nullable=False)
orderdate: Series[pa.DateTime] = pa.Field(coerce=True)
orderamount: Series[int] = pa.Field(coerce=True)
orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)
@pa.check("orderid")
def check_is_int(cls, series):
"""Check that values are integers"""
return series.map(lambda x: isinstance(x, int))
@pa.check("orderid")
def check_positive(cls, series):
"""Check that integer values are positive, ignoring non-integers"""
mask = series.map(lambda x: not isinstance(x, int))
return (series.mask(mask) > 0) | mask
This'll give you the error report you want:
schema_context column check check_number failure_case index
0 Column orderid check_is_int 0 abc 2
1 Column orderid check_positive 1 -1 1
However, it will not coerce the datatype of orderid
.
If you care about the coercion to integers, you can also implement a custom data type that implements a coerce
method to handle values of the wrong type before applying the range constriants, see here for more details on implementing custom dtypes.
I also offer consulting services for these kinds of custom use-cases, I'd be happy to work with you and your company to help you implement these solutions! Please reach out at niels.bantilan@gmail.com if you're interested.
converting this to a discussion, @learnnk feel free to mark continue the discussion there!
Hi Team, I'm using the Pandera Schema model to validate my file which as Integer column.
My Schema Model class Retaildata(pa.SchemaModel): orderid: Series[int] = pa.Field(coerce=True,nullable=False) orderdate: Series[pa.DateTime] = pa.Field(coerce=True) orderamount: Series[int] = pa.Field(coerce=True) orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)
Data orderid, orderdate, orderamount, orderstatus 0 1, 2013-07-25 00:00:00.0,11599, PENDING_PAYMENT 1 -1, 2013-07-25 00:00:00.0, 256, PENDING_PAYMENT 2 abc, 2013-07-25 00:00:00.0, 12111, COMPLETE output schema_context column ... failure_case index 0 Column orderid ... abc 2
[1 rows x 6 columns]
To test this column
Ouput- It always shows the failure case as String value-index number(which is point 1) not showing the Index value for (point 2) from above
Can you please suggest/guide how to capture the index of the failure case for negative row which ( is index 1 from data)