multimeric / PandasSchema

A validation library for Pandas data frames using user-friendly schemas
https://multimeric.github.io/PandasSchema/
GNU General Public License v3.0
189 stars 35 forks source link

ignore_nan for distinctvalues #37

Closed Maarten-vd-Sande closed 4 years ago

Maarten-vd-Sande commented 4 years ago

Thanks for the great tool! :wave:

Small & self-explanatory PR.

I added an argument to IsDistinctValidation called ignore_nan which when set to True does not take NaN values along when checking for duplicates.

Let me know if it needs extra work somewhere.

multimeric commented 4 years ago

Hi, can you first please check if allow_empty solves this use-case?

Maarten-vd-Sande commented 4 years ago

That seems to work, I saw #13 but somehow I thought it didn't work for me. Now when I check again it does seem to work just fine.

Thanks for your fast reply!

proof:

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3},
                   'C2': {'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

print(df)
schema = Schema([
    Column('C1', [IsDistinctValidation()], allow_empty=True),
    Column('C2', [IsDistinctValidation()], allow_empty=True),
    Column('C3', [IsDistinctValidation()], allow_empty=True)
])

for error in schema.validate(df):
    print(error)
    C1   C2  C3
A  5.0  NaN   3
B  2.0  NaN   4
C  3.0  4.0   6
D  NaN  2.0   8

[ no errors printed]