unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.12k stars 294 forks source link

Missing `reason_code` when using custom checks with PySpark dataframes #1645

Open melvinkokxw opened 1 month ago

melvinkokxw commented 1 month ago

Describe the bug Using a custom check with a PySpark dataframe raises the exception AttributeError: 'NoneType' object has no attribute 'name'

The cause for this is that reason_code is not provided raising SchemaError after a failed custom check, specifically here: https://github.com/unionai-oss/pandera/blob/d2bfed03e107358d60266108478711cdbe704e9c/pandera/backends/pyspark/base.py#L99-L107

And when collecting errors here: https://github.com/unionai-oss/pandera/blob/d2bfed03e107358d60266108478711cdbe704e9c/pandera/api/base/error_handler.py#L127

Trying to access .name on the non-existent reason_code (i.e. None) causes an AttributeError.

Code Sample, a copy-pastable example

import pandera.pyspark as psa
import pyspark.sql as ps
from pandera.extensions import register_check_method
from pyspark.sql import types as T

@register_check_method
def custom_check(pyspark_df: ps.DataFrame):
    return False

class Schema(psa.DataFrameModel):
    field1: T.IntegerType() = psa.Field()
    field2: T.IntegerType() = psa.Field()

    class Config:
        custom_check = ()

spark = ps.SparkSession.builder.appName("example").getOrCreate()

schema = T.StructType([
   T.StructField("field1", T.IntegerType(), True),
   T.StructField("field2", T.IntegerType(), True)])

data = [(1, 2)]

df = spark.createDataFrame(data, schema)
Schema.validate(df)

Expected behavior

Validation should fail, and raise a SchemaError (or SchemaErrors?) but not an AttributeError

Desktop (please complete the following information):

MatthiasRoels commented 1 month ago

I just stumbled upon the same issue. Glad someone already reported it and hopefully, it can be fixed soon...