great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
10.02k stars 1.55k forks source link

expect_column_values_to_be_unique is not clearly a column_map_expectation #291

Closed jpcampb2 closed 6 years ago

jpcampb2 commented 6 years ago

Currently, pandas implements expect_column_values_to_be_unique as a column_map_expectation, but as observed in #290, it should be a column_aggregate_expectation since uniqueness isn't a per-row property.

jpcampb2 commented 6 years ago

I shouldn't be so definitive, I think: the current expect_column_values_to_be_unique expectation is tricky because it is using the column_map_expectation semantics in our tests...but as I initially mentioned (now below) it isn't exactly a column_map_expectation--it would require re-joining against the counts to provide the same level of return value in the sqlalchemy context.

For now, I recommended we instead implement the proportion_of_unique_values_to_be_between expectation in sqlalchemy.

abegong commented 6 years ago

As I see it, expect_column_proportion_of_unique_values_to_be_between is the aggregate expectation and expect_column_values_to_be_unique is the map expectation. They do similar things, but I think it makes sense to have one of each.

Think about the output returned in result_object: sometimes you'll want to know row-by-row which values are duplicated, get counts of duplicate values, etc. This is exactly what a column map expectation does.

In other cases, you'll want to know the overall proportion of duplicated values, without bothering about row-by-row details. That's a use case for an aggregate expectation.

Bottom line: I think the way these expectations are currently implemented is correct.

jpcampb2 commented 6 years ago

Capturing today's discussion:

We'll plan to leave the expectations like this, with overlapping intents. We'll implement an (admittedly more expensive) sqlalchemy version of expect_column_values_to_be_unique eventually, and guide the current PR toward instead being an implementation of the column_aggregate_expectation.

lamirik commented 5 years ago

expect_column_proportion_of_unique_values_to_be_between is actually calculating the proportion of the unique values which are present within that column. What I actually understand from the name of the expectation is that it could measure the proportion of one label in contrast to other labels. This would make sense in a data validation context ( Like validating an imbalanced dataset : The proportion of 1s and 0s within the target variable ) .

abegong commented 5 years ago

@lamirik This sounds interesting---could provide the basis for a new Expectation.

Are you imagining something like expect_percent_of_specific_values_to_be_between?

If column X contains [1,1,1,1,1,1,2,2,3,3], then expect_percent_of_specific_values_to_be_between("X", specific_value=1, min_value=0.45, max_value=0.55) returns True because the actual percent is 50% ( = 0.5).

?

lamirik commented 5 years ago

yes that's what I am working with actually. I made my own expectation with the specific aggregation decorator which gives me the percentage of a specific label in contrast to another label. In my case I am actually working with binary columns and would extend it to the percentage of all possible labels in contrast to the total count.. :)

abegong commented 5 years ago

Would you be interested in submitting the new expectation as a PR? You're not the only one who's asked for something like this...

jpcampb2 commented 5 years ago

We recently got another request for a feature very much like this (please check out the slack channel too, BTW)--effectively, it's a nonstatistical distributional test. I'd like to see it support multiple values in a single definition:

expect_column_value_proportions_to_be_between("mycol", [{val: 1, min: 0.2, max:0.3}, {val:2, min:0.2, max:0.7}])

On Mon, Jun 3, 2019 at 11:59 AM Abe Gong notifications@github.com wrote:

Would you be interested in submitting the new expectation as a PR? You're not the only one who's asked for something like this...

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/great-expectations/great_expectations/issues/291?email_source=notifications&email_token=AFLQ65TQ77RLQOETB65VV4DPYU5WHA5CNFSM4FAJCWM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZ32HY#issuecomment-498318623, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLQ65QBPASSG3PDIQQIERDPYU5WHANCNFSM4FAJCWMQ .