Open jkleint opened 1 year ago
SAS-2362
+1 this.
I had to create custom duplicate checks for this exact issue. The duplicate_count()
check returns the number of unique values that have duplicates.
For instance, if a column contains the same value across all rows, the built-in duplicate_count()
will return 1, indicating “one value has duplicates.” However, based on the documentation, you might expect it to return the total number of rows, since they all contain the same duplicated value.
The documentation specifies that it should return "The number of rows that contain duplicate values.", which is misleading in this context.
First, thanks for a great tool -- I really appreciate Soda's focus on simplicity and readability.
According to the documentation,
duplicate_count()
counts "The number of rows that contain duplicate values." ^1 However, what it seems to actually do is count the number of unique values that have any duplicates. (The generated SQL is doing a group by and counting values that occur more than once). These are not the same. As a newcomer, this was a source of some confusion and frustration, since it is at odds with the usual definition of "duplicate" in many data frameworks.For instance, Pandas considers the following dataframe to have three duplicate rows:
Whereas Soda counts two:
May I kindly suggest clarifying the documentation to reflect the actual behavior, warning that it is at odds with the standard definition, and showing examples to make it clear. You might also consider adding a new metric with the more conventional behavior.
Thank you!