tl-its-umich-edu / unizin-validation

Unizin Validation Scripts
1 stars 4 forks source link

Add checks that raise colored flags when conditions are not met (#8) #9

Closed ssciolla closed 4 years ago

ssciolla commented 4 years ago

This PR adds new features to both validate.py and dbqueries.py that determine the proper exit code and format the plain text output based on the results of checks performed on query results. In addition, a test module with two test was added to ensure concerning results are flagged and labeled accordingly. More details will be provided below. The PR aims to resolve issue #8.

1) Our previous discussion about the validation work suggested that we wanted to both make the approach extensible and differentiate between results that were merely concerning and those that would want us to cancel other scheduled processes. With that in mind, I decided to embed the details about checks into dbqueries.py. Within each query dictionary, I added a checks dictionary containing color, condition and rows_to_ignore keys. color is designed to have as its value either YELLOW or RED depending on whether a check failure should only trigger a warning or stop a process, respectively. condition should have its value a lambda function resulting in True if the check passes and False otherwise. rows_to_ignore should be a list of indexes for rows not relevant to the check (e.g. schemaversion for the less_than_two_days check`).

2) I then refactored existing code in validate to create two new functions: run_checks_on_output and generate_results_text. I thought it would make the code cleaner to separate these steps, but I will admit the decision was somewhat arbitrary and may result in code duplication or unnecessary for loops. Combining them is certainly an option. run_checks_on_output takes the output of a query and, for each check defined in dbqueries.py, creates a new column (using the name of the check) of True and False values by using the pandas Series map method with the check's lambda function. Using map here may be too fancy and contribute to the code duplication I mentioned, but I wanted to try it out. I then check whether any False values occur in the column, adding the color for each check to a flag_strings list. The expanded DataFrame and list of flags are then returned as part of a namedtuple. generate_results_text then takes the query_name and the checked_output_df and cycles through the rows, creating a neat string representation of the query name, the data, and all instances where checks failed. When a False value is found in a row, a label is created with the form "{check name} condition failed". If multiple of these occur for a row, they are joined and then an arrow prefix (" <-- ") is added to the label.

3) Some other changes to the code were made to tie all of this together. The code under "Main Program" passes the relevant data from one function to the next, and then creates a list of flags without duplicates that is used as prefix for the email subject line. Note also that I simply return the output_df from execute_query_and_write_to_csv rather than re-opening the CSV in later functions, as we were doing before.

4) I also added two tests to ensure that the two functions described in 2. would work as expected (raising flags and modifying the text output) when concerning data (with values failing the given conditions) were passed through them.