Pyspark module - Column class does not support "unique" parameter.

Smartitect commented 1 year ago

Description of issue

when using the pandera.pyspark module, creation of new DataFrameSchema instance throws a TypeError when the unique parameter is included in Column class initialisation.

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandera.

Code Sample

from pandera.pyspark import DataFrameSchema, Column, Check
from pyspark.sql.types import StringType, IntegerType, DateType, FloatType

dataframe_schema_unique_parameter = DataFrameSchema(
    columns={
        "country": Column(
            dtype=StringType,
            unique=False,
        ),
    }
)

This generates the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [9], line 3
      1 dataframe_schema_unique_parameter = DataFrameSchema(
      2     columns={
----> 3         "country": Column(
      4             dtype=StringType,
      5             unique=True,
      6         ),
      7     }
      8 )

TypeError: Column.__init__() got an unexpected keyword argument 'unique'

Expected behaviour

When using the pandera.pyspark module, it should be possible to include the unique parameter in Column declarations so that the uniqueness of values in a Pyspark SQL dataframe column can be validated.

Environment

Azure Synapse Notebook
Browser: Edge
Python 3.10
Apache Spark 3.3
Pandera 0.16.1

Additional context

Really excited about the ability to use Pandera to validate big data on Spark. Currently working on blog describing how to leverage this package in Azure Synapse and Microsoft Fabric.

fernandocfbf commented 4 weeks ago

Hey there! I'm facing the same problem while using pandera with pyspark, is there any optimized way to check if a column contain unique values? did you manage to solve this problem?

filipeo2-mck commented 4 days ago

Hi @fernandocfbf ! Check this answer, it may solve your problem.

unionai-oss / pandera