unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

How to name the checks when using register_check_method #1332

Open gerileka opened 1 year ago

gerileka commented 1 year ago

Question about pandera

Hello, still new to Pandera and especially interested on the new Pyspark implementation. I am creating some new register_check_method for regex but I am having trouble naming the checks. As you can see, in both instances the check is set at None.

defaultdict(list,
            {'DATAFRAME_CHECK': [{'schema': 'PanderaSchema',
               'column': 'id',
               'check': 'greater_than(5)',
               'error': "column 'id' with type IntegerType() failed validation greater_than(5)"},
              {'schema': 'PanderaSchema',
               'column': 'product',
               'check': None,
               'error': "column 'product' with type StringType() failed validation None"},
              {'schema': 'PanderaSchema',
               'column': 'price',
               'check': None,
               'error': "column 'price' with type FloatType() failed validation None"}]})

Below you can find the code on how I achieve this result. Maybe I am missing something in the documentation.

import pandera.pyspark as pa
import pyspark.sql.types as T
from pyspark.sql.functions import col
from decimal import Decimal
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pandera.pyspark import DataFrameModel

from pandera.extensions import register_check_method
import pyspark.sql.types as T

spark = SparkSession.builder.getOrCreate()

data = [
    (5, "Bread1", 44.4, ["description of product"], {"product_category": "dairy"}),
    (15, "Butter", 99.0, ["more details here"], {"product_category": "bakery"}),
]

spark_schema = T.StructType(
    [
        T.StructField("id", T.IntegerType(), False),
        T.StructField("product", T.StringType(), False),
        T.StructField("price", T.FloatType(), False),
        T.StructField("description", T.ArrayType(T.StringType(), False), False),
        T.StructField(
            "meta", T.MapType(T.StringType(), T.StringType(), False), False
        ),
    ],
)
df = spark.createDataFrame(data, spark_schema)

@register_check_method
def new_pyspark_check(pyspark_obj, *, max_value) -> bool:
    cond = col(pyspark_obj.column_name) <= max_value
    return pyspark_obj.dataframe.filter(~cond).limit(1).count() == 0

@register_check_method
def regex_check(pyspark_obj, *, regex_expression) -> bool:
    cond = col(pyspark_obj.column_name).rlike(regex_expression)
    return pyspark_obj.dataframe.filter(~cond).limit(1).count() == 0

class PanderaSchema(DataFrameModel):
    id: T.IntegerType() = pa.Field(gt=5)
    product: T.StringType() = pa.Field(
            regex_check={
                "regex_expression": "^[a-zA-Z]+$"
            }
        )
    price: T.FloatType() = pa.Field(
            new_pyspark_check={
                "max_value": 10.0
            }
        )
    description: T.ArrayType(T.StringType()) = pa.Field()
    meta: T.MapType(T.StringType(), T.StringType()) = pa.Field()

validation = PanderaSchema.validate(check_obj=df)
import json

dict_out_errors = dict(validation.pandera.errors)
print(dict_out_errors["DATA"])
defaultdict(list,
            {'DATAFRAME_CHECK': [{'schema': 'PanderaSchema',
               'column': 'id',
               'check': 'greater_than(5)',
               'error': "column 'id' with type IntegerType() failed validation greater_than(5)"},
              {'schema': 'PanderaSchema',
               'column': 'product',
               'check': None,
               'error': "column 'product' with type StringType() failed validation None"},
              {'schema': 'PanderaSchema',
               'column': 'price',
               'check': None,
               'error': "column 'price' with type FloatType() failed validation None"}]})

Am I missing something on the naming ?

cosmicBboy commented 1 year ago

@NeerajMalhotra-QB @jaskaransinghsidana any idea on why the check name for custom registered check isn't showing up in the validation report?

NeerajMalhotra-QB commented 1 year ago

I could reproduce it and could only glance at it for now. I believe for some reason schema.check.error is coming null at link

def format_generic_error_message(
    parent_schema,
    check,
) -> str:
    """Construct an error message when a check validator fails.

    :param parent_schema: class of schema being validated.
    :param check: check that generated error.
    """
    return f"{parent_schema} failed validation " f"{check.error}"

which is invoked from: link

def run_check(
        self,
        check_obj,
        schema,
        check,
        check_index: int,
        *args,
    ) -> bool:

        check_result = check(check_obj, *args)
        if not check_result.check_passed:
            # encode scalar False values explicitly
            failure_cases = scalar_failure_case(check_result.check_passed)
            error_msg = format_generic_error_message(schema, check)

@jaskaransinghsidana, do you remember on top of your mind what could be causing this?

jaskaransinghsidana commented 1 year ago

@NeerajMalhotra-QB Don't remember it at top of my head, I need to take a look at the flow a bit.

jaskaransinghsidana commented 1 year ago

Took a look at this, it is coming from how the check is registered, when a base check is registered it has a attribute from BaseClass called error which holds the error information


class BaseCheck(metaclass=MetaCheck):
    """Check base class."""

    def __init__(
        self,
        name: Optional[str] = None,
        error: Optional[str] = None,
        statistics: Optional[Dict[str, Any]] = None,
    ):
        self.name = name
        self.error = error
        self.statistics = statistics

However, when a custom check is registered this attribute error for some reason it is None. @NeerajMalhotra-QB since this comes from api would you be able to take a look at it? @cosmicBboy I am not familiar with the register_check_method code, need some time, meanwhile can you tell me what's the expected difference between registering a builtin check vs custom ?