Open karutyunov opened 1 year ago
I've coded this quickly to solve this with a custom registration of a builtin check:
import pyspark.sql.types as T
from typing import Optional
from pyspark.sql import functions as F
from pandera.api.extensions import register_builtin_check
from pandera.backends.pyspark.utils import convert_to_list
from pandera.api.pyspark.types import PysparkDataframeColumnObject
from pandera.backends.pyspark.decorators import register_input_datatypes
@register_builtin_check(error="str_length({min_value}, {max_value})")
@register_input_datatypes(acceptable_datatypes=convert_to_list(T.StringType))
def str_length(
data: PysparkDataframeColumnObject,
min_value: Optional[int] = None,
max_value: Optional[int] = None,
) -> bool:
"""Ensure that the length of strings in a column is within a specified range."""
if min_value is None and max_value is None:
raise ValueError("Must provide at least one of 'min_value' and 'max_value'")
str_len = F.length(F.col(data.column_name))
cond = F.lit(True)
if min_value is not None:
cond = cond & (str_len >= min_value)
if max_value is not None:
cond = cond & (str_len <= max_value)
return data.dataframe.filter(~cond).limit(1).count() == 0
This should be added to the pyspark.sql builtin checks
Describe the bug
When trying to use the str_length function in pa.Field to validate the length of a string, we get a NotImplementedError every time. I tried passing arguments in different ways, as in the screenshot, and in the form of str_length(1, 2), both options give the same error
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Expected behavior
We expect a successful or unsuccessful test of the str_length function (validation error), but we get an error
Desktop (please complete the following information):
Additional context
As part of the tests, I decided to try the in_range function, because it has the same argument passing syntax - it works flawlessly