Should null values be counted as duplicates: Expected behavior for nullable=True and unique=True

tfwillems commented 3 years ago

Question about pandera

Currently, if a Field is created with unique=True and nullable=True, the data frame will fail validation if > 1 null values is present in the column. Although this seems consistent with the definitions of these attributes, my question is whether this should be the case or whether null values should be treated as special types omitted from duplicate checking.

import pandas as pd
from pandera import SchemaModel, Field
from pandera.typing import Series

class Model(SchemaModel):
    x: Series[float] = Field(nullable=True, unique=True)

Model.validate(pd.DataFrame({"x": [1.0, None, None]})) # Fails as None is regarded as a duplicate value
# pandera.errors.SchemaError: series 'x' contains duplicate values:

If this is the expected/ideal behavior:

Should users be allowed to create such fields, or should Field and/or pa.Column omit a ValueError upon instantiation with these values? I did not expect the behavior above intuitively, even though it makes perfect sense from their individual definitions. And I'm not sure if a Field with exactly 1 null is very useful
How should users best go about creating a Field that can have multiple nulls but requires that non-null values be unique? This could be added via a user's custom check, but I wonder whether there'd be interest in adding a new keyword argument to Field/pa.Column that enforces this (perhaps nonnull_unique)?

jeffzi commented 3 years ago

Hi @tfwillems, interesting question !

In ANSI SQL, NULL values cannot be compared (NULL = NULL is undetermined). Therefore an ANSI SQL compliant database will ignore null values if you add the UNIQUE constraint to a column. That's the behavior you expected.

However, pandas.Series.duplicated(), powering pandera's unique, regards null values as equals:

import numpy as np
import pandas as pd

# pandas and numpy treat nulls as SQL does
print(np.nan == np.nan)
#> False
print(pd.NA == pd.NA)
#> <NA>

# but duplicated does not
data = pd.Series([1.0, pd.NA, pd.NA, np.nan, np.nan, None, None])
print(data[data.duplicated()])
#> 2    <NA>
#> 4     NaN
#> 6    None
#> dtype: object

In that sense, pandera follows pandas, as expected.

Should users be allowed to create such fields, or should Field and/or pa.Column omit a ValueError upon instantiation with these values? I did not expect the behavior above intuitively, even though it makes perfect sense from their individual definitions. And I'm not sure if a Field with exactly 1 null is very useful

Field(nullable=True, unique=True) does not mean you can only have a single null value. [1.0, None] satisfies the constraints, so we should allow it. Let me know if I misunderstood your statement.

This could be added via a user's custom check, but I wonder whether there'd be interest in adding a new keyword argument to Field/pa.Column that enforces this (perhaps nonnull_unique)?

I agree ignoring null values can be desirable. Field already has an argument ignore_na passed down to checks. We could apply it to unique as well. We'd also need to add it to Column to replicate functionalities between SchemaModel and DataFrameSchema. There is on-going discussion about supporting Field in DataFrameSchema. Adding an argument to Column should probably wait the result of that discussion.

How should users best go about creating a Field that can have multiple nulls but requires that non-null values be unique?

A custom check for duplicates with ignore_na=True will indeed do the trick.

Pinging @cosmicBboy to pick his brain about re-using the ignore_na argument.

cosmicBboy commented 3 years ago

yeah, I think it makes sense to have an ignore_na kwarg in Column, which is passed down as the default to all checks in that column.

One question: would the user be able to override this default in the check?

pa.Column(ignore_na=True, checks=[pa.Check(...), pa.Check(..., ignore_na=False)])

+1 to ignore nulls when ignore_na=True and unique=True, we can make slight deviations from pandas behavior if the deviation makes more intuitive sense from a validation perpective:

if unique:
    duplicated = series.duplicated(keep=True)
    if ignore_na:
        # need to preserve the index, so mark all null values as not duplicated
        duplicated = duplicated.mask(series.isna(), False)

nathanjmcdougall commented 1 year ago

Just revisiting this, I think that a good solution would be to warn the user if they use unique=True and nullable=True together and if they fail to explicitly provide a value for ignore_na.

unionai-oss / pandera

Should null values be counted as duplicates: Expected behavior for nullable=True and unique=True #644

Question about pandera