Closed jpcampb2 closed 6 years ago
I shouldn't be so definitive, I think: the current expect_column_values_to_be_unique
expectation is tricky because it is using the column_map_expectation semantics in our tests...but as I initially mentioned (now below) it isn't exactly a column_map_expectation--it would require re-joining against the counts to provide the same level of return value in the sqlalchemy context.
For now, I recommended we instead implement the proportion_of_unique_values_to_be_between expectation in sqlalchemy.
As I see it, expect_column_proportion_of_unique_values_to_be_between
is the aggregate expectation and expect_column_values_to_be_unique
is the map expectation. They do similar things, but I think it makes sense to have one of each.
Think about the output returned in result_object
: sometimes you'll want to know row-by-row which values are duplicated, get counts of duplicate values, etc. This is exactly what a column map expectation does.
In other cases, you'll want to know the overall proportion of duplicated values, without bothering about row-by-row details. That's a use case for an aggregate expectation.
Bottom line: I think the way these expectations are currently implemented is correct.
Capturing today's discussion:
We'll plan to leave the expectations like this, with overlapping intents. We'll implement an (admittedly more expensive) sqlalchemy version of expect_column_values_to_be_unique
eventually, and guide the current PR toward instead being an implementation of the column_aggregate_expectation.
expect_column_proportion_of_unique_values_to_be_between is actually calculating the proportion of the unique values which are present within that column. What I actually understand from the name of the expectation is that it could measure the proportion of one label in contrast to other labels. This would make sense in a data validation context ( Like validating an imbalanced dataset : The proportion of 1s and 0s within the target variable ) .
@lamirik This sounds interesting---could provide the basis for a new Expectation.
Are you imagining something like expect_percent_of_specific_values_to_be_between
?
If column X contains [1,1,1,1,1,1,2,2,3,3], then expect_percent_of_specific_values_to_be_between("X", specific_value=1, min_value=0.45, max_value=0.55)
returns True
because the actual percent is 50% ( = 0.5).
?
yes that's what I am working with actually. I made my own expectation with the specific aggregation decorator which gives me the percentage of a specific label in contrast to another label. In my case I am actually working with binary columns and would extend it to the percentage of all possible labels in contrast to the total count.. :)
Would you be interested in submitting the new expectation as a PR? You're not the only one who's asked for something like this...
We recently got another request for a feature very much like this (please check out the slack channel too, BTW)--effectively, it's a nonstatistical distributional test. I'd like to see it support multiple values in a single definition:
expect_column_value_proportions_to_be_between("mycol", [{val: 1, min: 0.2, max:0.3}, {val:2, min:0.2, max:0.7}])
On Mon, Jun 3, 2019 at 11:59 AM Abe Gong notifications@github.com wrote:
Would you be interested in submitting the new expectation as a PR? You're not the only one who's asked for something like this...
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/great-expectations/great_expectations/issues/291?email_source=notifications&email_token=AFLQ65TQ77RLQOETB65VV4DPYU5WHA5CNFSM4FAJCWM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZ32HY#issuecomment-498318623, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLQ65QBPASSG3PDIQQIERDPYU5WHANCNFSM4FAJCWMQ .
Currently, pandas implements
expect_column_values_to_be_unique
as acolumn_map_expectation
, but as observed in #290, it should be acolumn_aggregate_expectation
since uniqueness isn't a per-row property.