great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.94k stars 1.54k forks source link

Groupby Utility Function #402

Closed anhollis closed 4 years ago

anhollis commented 5 years ago

Below is a potential solution to the recurring problem of mutli-column expectation and performing expectations on grouped data. In this solution, users specify some object that can be used to group the data. This object could be a column, a set of keys, etc. The user would also specify the data set on which to build the expectations, and the expectations they want to run, along with a dictionary of those expectation arguments. The return value is a dictionary that contains an entry for every group. These entries are themselves dictionaries that contain expectation results for each expectation that was run.

I think this is different from the pre-processing approach (#294) in that it would allow the user to specify multiple group expectations to examine simultaneously. Some other related issues are(#351, #373, #236).

It is possible that this kind of function is trying to solve too specific of a problem. We might prefer to have a function that addresses something more general than simply grouping data and running expectations. We might want to avoid specific utility functions altogether, but I still thought it might be worth considering.

Below is a full reproducible example of how this solution would work for a pandas data set. The concept can be easily extended to sql, xlm, and json data; this would only require a change in the way that data is grouped, but the process would be similar in each case. We would likely need to write a separate group_by function for each type of data set we would want to consider.

#Demo Of Potential Group By function
import great_expectations as ge #Import Great Expectations
import pandas as pd #Import Pandas
titanic_dat=pd.read_csv("./tests/test_sets/titanic.csv") #Read in Titanic data

group_column_name="Sex" #Specify the sex variable in the Titanic data to use as the grouping variable

main_dat_object=titanic_dat #Specify the Titanic data as the data to run expectations on

expectations={"expect_table_row_count_to_be_between":{"min_value":5,"max_value":50},
              "expect_column_median_to_be_between":{"column":"Age","min_value":10,"max_value":60}} #Specify the expectations and their arguments

def pandas_groupby(main_dat_object,expectations,group_column_name=None):

    group_column=main_dat_object[group_column_name] #Extract the column to use for grouping

    group_levels=group_column.unique() #Extract the levels of the grouping variable. The grouping variable is assumed to be categorical

    group_results={} #Set up a null dictionary to save the expectation results for different groups

    for level in group_levels: #Loop over the group levels

        group_data=main_dat_object[getattr(main_dat_object,group_column_name)==level] #Subset the data by the currrent group level

        group_pandas=ge.dataset.PandasDataTable(group_data) #Build a ge PandasDataSet object from the subsetted data

        expectations_results={} #Create a dictionary to hold the results for the different expectations

        for expectation in expectations: #Loop over the expectations

            expectations_results[expectation]=getattr(group_pandas,
                               expectation)(**expectations[expectation]) #Run the current expectation and store the results in expectation_results

        group_results[level]=expectations_results #Store the results of the group specific expectations in group_results

    return group_results #Return group_results

expectation_results=pandas_groupby(main_dat_object,expectations,group_column_name=group_column_name)
stale[bot] commented 4 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

krlng commented 4 years ago

I am also really missing such a feature. Sadly the solution provided above is outdated. Is there any work-around solution with the current version (0.9.7). Are there plans to include such a group / filter functionality for expectations in the future?

krlng commented 4 years ago

Okay, the above solution works with from great_expectations.dataset.pandas_dataset.PandasDataset instead of ge.dataset.PandasDataTable.

Still, I don't know how to include this in a suite. I thought of creating a custom expectation, but either this one needs to be implemented for each expectation_type that should be group-able or it would require to wrap other expectations. IMHO the much better way would be an additional argument for all expectations of a backend which allows the expectation to work only on a subset of the batch, based on some filter defined in the argument.

github-actions[bot] commented 4 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?\n\nThis issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.