Open tfwillems opened 3 years ago
Hi @tfwillems, interesting question !
In ANSI SQL, NULL values cannot be compared (NULL = NULL is undetermined). Therefore an ANSI SQL compliant database will ignore null values if you add the UNIQUE constraint to a column. That's the behavior you expected.
However, pandas.Series.duplicated()
, powering pandera's unique
, regards null values as equals:
import numpy as np
import pandas as pd
# pandas and numpy treat nulls as SQL does
print(np.nan == np.nan)
#> False
print(pd.NA == pd.NA)
#> <NA>
# but duplicated does not
data = pd.Series([1.0, pd.NA, pd.NA, np.nan, np.nan, None, None])
print(data[data.duplicated()])
#> 2 <NA>
#> 4 NaN
#> 6 None
#> dtype: object
In that sense, pandera follows pandas, as expected.
Should users be allowed to create such fields, or should Field and/or pa.Column omit a ValueError upon instantiation with these values? I did not expect the behavior above intuitively, even though it makes perfect sense from their individual definitions. And I'm not sure if a Field with exactly 1 null is very useful
Field(nullable=True, unique=True)
does not mean you can only have a single null value. [1.0, None]
satisfies the constraints, so we should allow it. Let me know if I misunderstood your statement.
This could be added via a user's custom check, but I wonder whether there'd be interest in adding a new keyword argument to Field/pa.Column that enforces this (perhaps nonnull_unique)?
I agree ignoring null values can be desirable. Field
already has an argument ignore_na
passed down to checks. We could apply it to unique
as well. We'd also need to add it to Column
to replicate functionalities between SchemaModel
and DataFrameSchema
. There is on-going discussion about supporting Field
in DataFrameSchema
. Adding an argument to Column
should probably wait the result of that discussion.
How should users best go about creating a Field that can have multiple nulls but requires that non-null values be unique?
A custom check for duplicates with ignore_na=True
will indeed do the trick.
Pinging @cosmicBboy to pick his brain about re-using the ignore_na
argument.
yeah, I think it makes sense to have an ignore_na
kwarg in Column
, which is passed down as the default to all checks in that column.
One question: would the user be able to override this default in the check?
pa.Column(ignore_na=True, checks=[pa.Check(...), pa.Check(..., ignore_na=False)])
+1 to ignore nulls when ignore_na=True
and unique=True
, we can make slight deviations from pandas behavior if the deviation makes more intuitive sense from a validation perpective:
if unique:
duplicated = series.duplicated(keep=True)
if ignore_na:
# need to preserve the index, so mark all null values as not duplicated
duplicated = duplicated.mask(series.isna(), False)
Just revisiting this, I think that a good solution would be to warn the user if they use unique=True
and nullable=True
together and if they fail to explicitly provide a value for ignore_na
.
Question about pandera
Currently, if a Field is created with unique=True and nullable=True, the data frame will fail validation if > 1 null values is present in the column. Although this seems consistent with the definitions of these attributes, my question is whether this should be the case or whether null values should be treated as special types omitted from duplicate checking.
If this is the expected/ideal behavior: