when using the pandera.pyspark module, creation of new DataFrameSchema instance throws a TypeError when the unique parameter is included in Column class initialisation.
[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandera.
When using the pandera.pyspark module, it should be possible to include the unique parameter in Column declarations so that the uniqueness of values in a Pyspark SQL dataframe column can be validated.
Environment
Azure Synapse Notebook
Browser: Edge
Python 3.10
Apache Spark 3.3
Pandera 0.16.1
Additional context
Really excited about the ability to use Pandera to validate big data on Spark. Currently working on blog describing how to leverage this package in Azure Synapse and Microsoft Fabric.
Hey there!
I'm facing the same problem while using pandera with pyspark, is there any optimized way to check if a column contain unique values? did you manage to solve this problem?
Description of issue
when using the
pandera.pyspark
module, creation of new DataFrameSchema instance throws aTypeError
when theunique
parameter is included inColumn
class initialisation.Code Sample
This generates the following error:
Expected behaviour
When using the
pandera.pyspark
module, it should be possible to include theunique
parameter inColumn
declarations so that the uniqueness of values in a Pyspark SQL dataframe column can be validated.Environment
Additional context
Really excited about the ability to use Pandera to validate big data on Spark. Currently working on blog describing how to leverage this package in Azure Synapse and Microsoft Fabric.