unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.05k stars 281 forks source link

Raise SchemaInitError when for unique field is specified in pyspark field and a test showing the issue as well as config when it works. #1592

Closed zippeurfou closed 3 weeks ago

zippeurfou commented 1 month ago

Following #1344 I am adding a bit of edited documentation. I wasn't able to raise SchemaInitError as @cosmicBboy suggested as it turns out that if you use pa.Field(unique=True) it is seen as None which most likely is the issue. In the follow up screenshot you can see the behavior when I did add the code where @cosmicBboy suggested.

Screenshot 2024-04-19 at 5 39 04 PM

As the ghost text show it is None when I put a breakpoint there so I did not add it.

cosmicBboy commented 4 weeks ago

if you use pa.Field(unique=True) it is seen as None which most likely is the issue.

How is this the case? Specifying pa.Field(unique=True) should translate to unique is True when the DataFrameModel is translated into a DataFrameSchema

(Also see here to signing your commits for the DCO check)

cosmicBboy commented 4 weeks ago

Okay so there are two issues here:

  1. specifying unique at the dataframe-level
import pandera.pyspark as pa

class Model(pa.DataFrameModel):
    class Config:
        unique = ["col1", "col2"]  # col1 and col2 should be jointly unique
  1. specifying pa.Field(unique=True) at the column level.
class Model(pa.DataFrameModel):
    col: int = pa.Field(unique=True)  # values in col need to be unique

For 1 the SchemaInitError here still makes sense: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/container.py#L148

For 2, a SchemaInitError here needs to be added: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/model_components.py#L184. This is because the underlying Column definition doesn't even support the unique argument: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/components.py#L15-L27

zippeurfou commented 4 weeks ago

Thanks @cosmicBboy, Let me rephrase things a bit: For 1 with config it does work as you expressed. I added a unit test with only one column but I can add a second one with 2 columns. For 2, I can add it where you mentioned. I hadn't had the time to look too much at how internal works so I appreciate the direction. My guess is given 1 was implemented and works as expected, 2 should not be impossible to implement but right now I don't have the bandwidth to do it sadly as I would need extra time to understand the internal of the library. For the DCO I will try to do it as well when I have time. I appreciate the direction.

zippeurfou commented 3 weeks ago

@cosmicBboy updated the PR according to my understanding.

zippeurfou commented 3 weeks ago

I am not sure why the linter didn't execute here.

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 83.16%. Comparing base (4df61da) to head (4fffae2). Report is 75 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1592 +/- ## =========================================== - Coverage 94.29% 83.16% -11.14% =========================================== Files 91 114 +23 Lines 7024 8504 +1480 =========================================== + Hits 6623 7072 +449 - Misses 401 1432 +1031 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

cosmicBboy commented 3 weeks ago

Thanks @zippeurfou and congrats on your first contribution to pandera! 🚀