Closed zippeurfou closed 3 weeks ago
if you use pa.Field(unique=True) it is seen as None which most likely is the issue.
How is this the case? Specifying pa.Field(unique=True)
should translate to unique is True
when the DataFrameModel is translated into a DataFrameSchema
(Also see here to signing your commits for the DCO check)
Okay so there are two issues here:
unique
at the dataframe-levelimport pandera.pyspark as pa
class Model(pa.DataFrameModel):
class Config:
unique = ["col1", "col2"] # col1 and col2 should be jointly unique
pa.Field(unique=True)
at the column level.class Model(pa.DataFrameModel):
col: int = pa.Field(unique=True) # values in col need to be unique
For 1 the SchemaInitError
here still makes sense: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/container.py#L148
For 2, a SchemaInitError
here needs to be added: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/model_components.py#L184. This is because the underlying Column definition doesn't even support the unique argument: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/components.py#L15-L27
Thanks @cosmicBboy, Let me rephrase things a bit: For 1 with config it does work as you expressed. I added a unit test with only one column but I can add a second one with 2 columns. For 2, I can add it where you mentioned. I hadn't had the time to look too much at how internal works so I appreciate the direction. My guess is given 1 was implemented and works as expected, 2 should not be impossible to implement but right now I don't have the bandwidth to do it sadly as I would need extra time to understand the internal of the library. For the DCO I will try to do it as well when I have time. I appreciate the direction.
@cosmicBboy updated the PR according to my understanding.
I am not sure why the linter didn't execute here.
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 83.16%. Comparing base (
4df61da
) to head (4fffae2
). Report is 75 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks @zippeurfou and congrats on your first contribution to pandera! 🚀
Following #1344 I am adding a bit of edited documentation. I wasn't able to raise
SchemaInitError
as @cosmicBboy suggested as it turns out that if you usepa.Field(unique=True)
it is seen as None which most likely is the issue. In the follow up screenshot you can see the behavior when I did add the code where @cosmicBboy suggested.As the ghost text show it is None when I put a breakpoint there so I did not add it.