unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.34k stars 308 forks source link

Pyspark module - str_length check not implemented. #1314

Open Smartitect opened 1 year ago

Smartitect commented 1 year ago

Description of issue

When using the pandera.pyspark module, validation of a DataFrameSchema that uses Check.str_length() in a column level check generates NotImplementedError.

Code Sample

from pandera.pyspark import DataFrameSchema, Column, Check
from pyspark.sql.types import StringType, IntegerType, DateType, FloatType

dataframe_schema_string_length = DataFrameSchema(
    columns={
        "index": Column(
            dtype=IntegerType,
        ),
        "participant_id": Column(
            dtype=StringType,
            checks=[
                Check.str_length(32)
            ],
        ),
    },
    coerce=True,
    strict=False,
)

df_to_validate = spark.createDataFrame(
    [
        (1, "ee584ba55112f89ec9d5a7cabd52f705"),
        (2, "fcfd946100c0147583b63b6789dc0252"),
        (3, "091631f6c8bdf72e7c55d4d91b874c43"),
        (4, "433cb57085b9b4f3f268e655108b637d"),
        (5, "60d6aad80845ff205936d1ff4b290f00"),
        ],
    ["index", "participant_id"])

dataframe_schema_string_length.validate(df_to_validate).pandera.errors

This generates the following output:

defaultdict(<function pandera.api.pyspark.error_handler.ErrorHandler.__init__.<locals>.<lambda>()>,
            {'DATA': defaultdict(list,
                         {'CHECK_ERROR': [{'schema': None,
                            'column': 'participant_id',
                            'check': 'str_length(32, None)',
                            'error': 'Error while executing check function: NotImplementedError ...'}]})})

Expected behaviour

When using the pandera.pyspark module, to be able to use Check.str_length() check when validating a Pyspark SQL dataframe against a DataFrameSchema object.

Environment

Additional context

Really excited about the ability to use Pandera to validate big data on the Spark platform. Working on blog describing how to leverage this package in Azure Synapse and Microsoft Fabric.

Smartitect commented 1 year ago

Just spotted that this also reported under issue #1311.