Open Smartitect opened 1 year ago
Hey there! I'm facing the same problem while using pandera with pyspark, is there any optimized way to check if a column contain unique values? did you manage to solve this problem?
Hi @fernandocfbf ! Check this answer, it may solve your problem.
Description of issue
when using the
pandera.pyspark
module, creation of new DataFrameSchema instance throws aTypeError
when theunique
parameter is included inColumn
class initialisation.Code Sample
This generates the following error:
Expected behaviour
When using the
pandera.pyspark
module, it should be possible to include theunique
parameter inColumn
declarations so that the uniqueness of values in a Pyspark SQL dataframe column can be validated.Environment
Additional context
Really excited about the ability to use Pandera to validate big data on Spark. Currently working on blog describing how to leverage this package in Azure Synapse and Microsoft Fabric.