Describe the bug
Currently, when you register a custom check in pyspark there is no option to add a custom error message as one can do in the register builtin check. This leads to the error message on check fail to be None.
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandera.
[X] (optional) I have confirmed this bug exists on the main branch of pandera.
Code Sample, a copy-pastable example
import json
import pandera.pyspark as pa
import pyspark.sql.types as T
from pandera.api.extensions import register_check_method
from pandera.api.pyspark.types import PysparkDataframeColumnObject
from pandera.backends.pyspark.decorators import register_input_datatypes
from pandera.backends.pyspark.utils import convert_to_list
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("A", 1), ("B", -1)]
schema = T.StructType([T.StructField("id", T.StringType(), True), T.StructField("quantity", T.IntegerType(), True)])
orders = spark.createDataFrame(data, schema=schema)
@register_check_method() # error="fraction_ge({value=}, {fraction=})"
@register_input_datatypes(acceptable_datatypes=convert_to_list(T.IntegerType))
def fraction_ge(data: PysparkDataframeColumnObject, value: int, fraction: float) -> bool:
"""Ensure that at least a specified fraction of integer values in a column are greater than or equal to a threshold."""
if not 0 <= fraction <= 1:
raise ValueError("Fraction must be between 0 and 1")
total_count = data.dataframe.count()
if total_count == 0:
return False
cond = F.col(data.column_name) >= value
valid_count = data.dataframe.filter(cond).count()
return (valid_count / total_count) >= fraction
class OrdersSchema(pa.DataFrameModel):
id: T.StringType
quantity: T.IntegerType = pa.Field(fraction_ge={"value": 0, "fraction": 0.9})
orders = OrdersSchema.validate(orders)
print(json.dumps(orders.pandera.errors, indent=4))
Ideally, the @register_check_method method should have an optional error parameter like the @register_builtin_check has. With the example above, the decorator would look like:
Describe the bug Currently, when you register a custom check in pyspark there is no option to add a custom error message as one can do in the register builtin check. This leads to the error message on check fail to be
None
.Code Sample, a copy-pastable example
Result:
Expected behavior
Ideally, the
@register_check_method
method should have an optionalerror
parameter like the@register_builtin_check
has. With the example above, the decorator would look like:The output on a failed check would be: