great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.91k stars 1.53k forks source link

expect_column_values_to_be_in_set allows None objects to pass #412

Closed pybokeh closed 4 years ago

pybokeh commented 5 years ago

I have a dataframe column called "ModelName" and it contains Python None object.

So I did not expect the following test to pass:

if ge_df.expect_column_values_to_be_in_set(column="ModelName", value_set=['ACCORD','CIVIC','CROSSTOUR','CRV','CSX','EL',
                                                                          'ELEMENT','FIT','HRV','ILX','INSIGHT','MDX','NSX',
                                                                          'ODYSSEY','PASSPORT','PILOT','RDX','RIDGELINE','TL','TLX','ZDX'
                                                                         ])['success']:
    print('Passed Model Name Check')
else:
    print('FAILED Model Name Check')

Output: Passed Model Name Check

I see there is another expectation to check for Null values and so this check worked as expected:

if ge_df.expect_column_values_to_not_be_null(column="ModelName")['success']:
    print('No model names are null')
else:
    print('Null model names found')

Output: Null model names found

So my question is, shouldn't the first test have failed? It appears the expect_column_values_to_be_in_set expectation does not account for null or None objects. It would be preferable to not have to use 2 expectations to account for null or None objects.

jcampbell commented 5 years ago

Currently, the behavior you're observing is the expected behavior. We have talked about options for potentially allowing more control over when to ignore values, but haven't settled on an API.

That said, the expectations result does return the number and percentage of missing values that were ignored in evaluating the expectation. So, in your case, you could get the behavior you're looking for without two expectations by checking both the 'success' value and the number/percentage of missing values.

pybokeh commented 5 years ago

@jcampbell Hmmm, interesting. With the understanding of what a "set" mathematically implies, I would have expect that the provided values in the set would be explicitly exclusive that I want to test the column against. Therefore, it would not be obvious or intuitive to me to also have to check for the number/percentage of missing values. Would be interesting to see what other users think about this current behavior since it is possible that my viewpoint on the semantics of what a "set" should imply may be in the minority.

jcampbell commented 5 years ago

I agree that this is a tricky semantic question. One thing to remember is that the missing/ignored values behavior that you're observing with respect to set membership is common to all expectation evaluation logic (so, for example, None values do not cause expectations about values being in a specific range to fail either).

jcampbell commented 5 years ago

@pybokeh : Just wanted to check in on this (my mind is on missing values after a PR this week that optimized computation for expect_column_values_to_(not)_be_null: did you end up keeping two expectations, and/or do you think the right solution is to change the current behavior?

One option could be to add a parameter to all expectations that defines what values count as missing. So, the definition of expect_column_values_to_be_in_set would change to:

    def expect_column_values_to_be_in_set(self,
                                          column,
                                          value_set,
                                          mostly=None,
                                          {null/missing/skip}_value_list=[None],
                                          parse_strings_as_datetimes=None,
                                          result_format=None, include_config=False, catch_exceptions=None, 
                                          meta=None
                                          ):

That would keep the default behavior, but specifying null_value_list=[] would give the behavior you described.

pybokeh commented 5 years ago

"I agree that this is a tricky semantic question. One thing to remember is that the missing/ignored values behavior that you're observing with respect to set membership is common to all expectation evaluation logic (so, for example, None values do not cause expectations about values being in a specific range to fail either)."

@jcampbell To be honest, I would prefer the current behavior to change at least for the expect_column_values_to_be_in_set() function. For other expectations, I think it is ok that missing/Null/None values are handled as is currently (sorry I haven't fully studied all of them) or with the proposed additional parameter. I think a "set" is pretty explicit on what a user would expect (sorry for the unintended pun), that is, the column should solely contain exactly, and only, the provided values. Anything else, the test should fail.

As for your proposed API change to include an additional parameter, that looks good to me. But that would require me or the user to know beforehand what the potential missing values would be (null vs None) in the data which may not always be possible. Also, your proposed change would imply that for me to capture both None and null values, I have have to specify 2 additional parameters? Is that correct? Like so:

expect_column_values_to_be_in_set(self,
                                          column,
                                          value_set,
                                          mostly=None,
                                          null_value_list=[],
                                          missing_value_list=[],
                                          parse_strings_as_datetimes=None,
                                          result_format=None, include_config=False, catch_exceptions=None, 
                                          meta=None
                                          ):

Sorry if I mis-understood your proposal. Your {null/missing/skip}_value_list parameter is throwing me off. If you could provide a minimal example of this proposed changed, that would be helpful for me.

Regardless, as a "fail-safe", I would make at least expect_column_values_to_be_in_set() function fail when there are values that were not in the set of values provided by the user.

jcampbell commented 5 years ago

My notation wasn't clear--I was just suggesting that the name of the single parameter should be up for discussion. A minimal example would be:

expect_column_values_to_be_in_set(self,
                                          column,
                                          value_set,
                                          mostly=None,
                                          missing_value_list=[],
                                          parse_strings_as_datetimes=None,
                                          result_format=None, include_config=False, catch_exceptions=None, 
                                          meta=None
                                          ):

I definitely think I understand your perspective and it makes sense to me. Since it would be a significant change in the behavior without fundamentally providing more information (because the current result object does include missing counts as well), I favor the approach of exposing a parameter so that it's clear that there is a path to keeping the old behavior even as we change the default behavior.

pybokeh commented 5 years ago

@jcampbell Sounds reasonable to me!

stale[bot] commented 4 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.