unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.24k stars 302 forks source link

Add custom error message for pyspark `register_check_method` which currently defaults to `None` #1716

Open marrov opened 2 months ago

marrov commented 2 months ago

Describe the bug Currently, when you register a custom check in pyspark there is no option to add a custom error message as one can do in the register builtin check. This leads to the error message on check fail to be None.

Code Sample, a copy-pastable example

import json
import pandera.pyspark as pa
import pyspark.sql.types as T
from pandera.api.extensions import register_check_method
from pandera.api.pyspark.types import PysparkDataframeColumnObject
from pandera.backends.pyspark.decorators import register_input_datatypes
from pandera.backends.pyspark.utils import convert_to_list
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("example").getOrCreate()
data = [("A", 1), ("B", -1)]
schema = T.StructType([T.StructField("id", T.StringType(), True), T.StructField("quantity", T.IntegerType(), True)])
orders = spark.createDataFrame(data, schema=schema)

@register_check_method()  # error="fraction_ge({value=}, {fraction=})"
@register_input_datatypes(acceptable_datatypes=convert_to_list(T.IntegerType))
def fraction_ge(data: PysparkDataframeColumnObject, value: int, fraction: float) -> bool:
    """Ensure that at least a specified fraction of integer values in a column are greater than or equal to a threshold."""
    if not 0 <= fraction <= 1:
        raise ValueError("Fraction must be between 0 and 1")
    total_count = data.dataframe.count()
    if total_count == 0:
        return False
    cond = F.col(data.column_name) >= value
    valid_count = data.dataframe.filter(cond).count()

    return (valid_count / total_count) >= fraction

class OrdersSchema(pa.DataFrameModel):
    id: T.StringType
    quantity: T.IntegerType = pa.Field(fraction_ge={"value": 0, "fraction": 0.9})

orders = OrdersSchema.validate(orders)
print(json.dumps(orders.pandera.errors, indent=4))

Result:

{
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "OrdersSchema",
                "column": "quantity",
                "check": "fraction_ge",
                "error": "column 'quantity' with type IntegerType() failed validation None"
            }
        ]
    }
}

Expected behavior

Ideally, the @register_check_method method should have an optional error parameter like the @register_builtin_check has. With the example above, the decorator would look like:

@register_check_method(error="fraction_ge({value=}, {fraction=})")

The output on a failed check would be:

{
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "OrdersSchema",
                "column": "quantity",
                "check": "fraction_ge",
                "error": "column 'quantity' with type IntegerType() failed validation fraction_ge(value=0, fraction=0.9)"
            }
        ]
    }
}
cosmicBboy commented 2 months ago

@marrov please feel free to make a PR for this!