sodadata / soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
https://go.soda.io/core-docs
Apache License 2.0
1.89k stars 208 forks source link

duplicate_count semantics are unclear and surprising #1949

Open jkleint opened 1 year ago

jkleint commented 1 year ago

First, thanks for a great tool -- I really appreciate Soda's focus on simplicity and readability.

According to the documentation, duplicate_count() counts "The number of rows that contain duplicate values." ^1 However, what it seems to actually do is count the number of unique values that have any duplicates. (The generated SQL is doing a group by and counting values that occur more than once). These are not the same. As a newcomer, this was a source of some confusion and frustration, since it is at odds with the usual definition of "duplicate" in many data frameworks.

For instance, Pandas considers the following dataframe to have three duplicate rows:

dupes = pd.DataFrame({'dupes': ['a', 'a', 'b', 'b', 'b']})
dupes.duplicated()
0    False
1     True
2    False
3     True
4     True

Whereas Soda counts two:

scan = Scan()
scan.set_data_source_name("dask")
scan.add_sodacl_yaml_str("""
    checks for duplicates:
        - duplicate_count(dupes) = 3                         
    """)
scan.add_pandas_dataframe("duplicates", dupes)

scan.execute()
print(scan.get_logs_text())
INFO   | Soda Core 3.0.51
INFO   | Using DefaultSampler
INFO   | Scan summary:
INFO   | 1/1 check FAILED: 
INFO   |     duplicates in dask
INFO   |       duplicate_count(dupes) = 3 [FAILED]
INFO   |         check_value: 2
INFO   | Oops! 1 failures. 0 warnings. 0 errors. 0 pass.

May I kindly suggest clarifying the documentation to reflect the actual behavior, warning that it is at odds with the standard definition, and showing examples to make it clear. You might also consider adding a new metric with the more conventional behavior.

Thank you!

tools-soda commented 1 year ago

SAS-2362

michal-porebski commented 1 month ago

+1 this.

I had to create custom duplicate checks for this exact issue. The duplicate_count() check returns the number of unique values that have duplicates.

For instance, if a column contains the same value across all rows, the built-in duplicate_count() will return 1, indicating “one value has duplicates.” However, based on the documentation, you might expect it to return the total number of rows, since they all contain the same duplicated value.

The documentation specifies that it should return "The number of rows that contain duplicate values.", which is misleading in this context.