unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.34k stars 308 forks source link

Pyspark module - Column class does not support "unique" parameter. #1313

Open Smartitect opened 1 year ago

Smartitect commented 1 year ago

Description of issue

when using the pandera.pyspark module, creation of new DataFrameSchema instance throws a TypeError when the unique parameter is included in Column class initialisation.

Code Sample

from pandera.pyspark import DataFrameSchema, Column, Check
from pyspark.sql.types import StringType, IntegerType, DateType, FloatType

dataframe_schema_unique_parameter = DataFrameSchema(
    columns={
        "country": Column(
            dtype=StringType,
            unique=False,
        ),
    }
)

This generates the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [9], line 3
      1 dataframe_schema_unique_parameter = DataFrameSchema(
      2     columns={
----> 3         "country": Column(
      4             dtype=StringType,
      5             unique=True,
      6         ),
      7     }
      8 )

TypeError: Column.__init__() got an unexpected keyword argument 'unique'

Expected behaviour

When using the pandera.pyspark module, it should be possible to include the unique parameter in Column declarations so that the uniqueness of values in a Pyspark SQL dataframe column can be validated.

Environment

Additional context

Really excited about the ability to use Pandera to validate big data on Spark. Currently working on blog describing how to leverage this package in Azure Synapse and Microsoft Fabric.

fernandocfbf commented 3 days ago

Hey there! I'm facing the same problem while using pandera with pyspark, is there any optimized way to check if a column contain unique values? did you manage to solve this problem?