opensafely-actions / cohort-report

Cohort-report generates a report for each variable in an input file
MIT License
0 stars 0 forks source link

Redact after transform #46

Closed iaindillingham closed 2 years ago

iaindillingham commented 2 years ago

@inglesp very kindly talked through this PR with me today. It's now ready, Peter. I have:

If you're happy with it, then I will squash the "feature" commits (from a2623da) and merge.


Problem

Previously, columns in the patient records table were redacted before they were transformed. In the following snippet, df is the patient records table; suppress_low_numbers is the redaction function; series_report is a statistical transformation (it calculates descriptive statistics); and series_graph is a visual transformation (it plots either a bar chart or a histogram).

https://github.com/opensafely-actions/cohort-report/blob/d9237b52874cad29e8dc6884c1e7b1718cec188b/cohortreport/report.py#L79-L87

The redaction function is essentially a group by/count (series.value_counts()) of the values in a column. If the count of any group in the column -- including the "null" group -- is less than a threshold, then the column is redacted. If not, then the column is returned. Consequently, columns with many, low count groups will be redacted; columns that are drawn from continuous distributions often have this characteristic and so often are redacted. Clearly, this is undesirable.

Solution

Now, columns in the patient records table are transformed before they are redacted. More specifically:

The above has also allowed us to:

Some additional "nice to haves" are:

Fixes #40 Fixes #50

iaindillingham commented 2 years ago

I've mentioned "safe" descriptive statistics, above. I'm planning to use the definitions in:

M. Brandt et al., ‘Guidelines for the checking of output based on microdata research’, Jan. 2010, Accessed: Oct. 28, 2021. [Online]. Available: https://uwe-repository.worktribe.com/output/983615

You'll see that actually, most of the descriptive statistics are unsafe (maximum, minimum, percentiles, means, etc.). However, I'm planning to follow the overall/specific rules of thumb to make output checking easier.

Why Brandt et al.? Because Felix Ritchie, who runs the output checking course, helped develop The Five Safes framework; this framework incorporates Statistical Disclosure Control (SDC); and Brandt et al. is cited as the "formal" definition of SDC.

CarolineMorton commented 2 years ago

This looks good so far to me, and addresses a few things that are long overdue. One thing that does occur to me is this PR does a lot:

  1. Remove Plotly for Matplotlib
  2. Add different descriptive statistics
  3. refactor group, redact, and plot functions and their order
  4. Switch axes on graphs

These are all things that need to happen but I am wondering if small PRs might be better each tied to an issue. I am not sure how practical that will be esp given the change to matplotlib but I think might be easier to understand when looking at this in future.

You mention in the nice-to-haves, a cleaner API. What did you have in mind?

iaindillingham commented 2 years ago

The PR does a lot, but it does it with a little code; indeed, if I removed what's no longer used, then I think this PR would be a net reduction in code. Setting that aside, it's hard to split this PR into multiple PRs because:

WRT the cleaner API, you can see this emerging in cohortreport.processing: we have main redact and plot functions, which delegate to private helper functions such as _get_unit_mask and _plot_hist. The main and private helper functions can be tested independently. We could also make them available on a series by using register_series_accessor. For example:

# Rather than calling like this...
plot(my_series)
# We could call like this...
my_series.opensafely.plot()
iaindillingham commented 2 years ago

Let's consider the descriptive statistics. Previously, we called Series.describe():

https://github.com/opensafely-actions/cohort-report/blob/d9237b52874cad29e8dc6884c1e7b1718cec188b/cohortreport/series_report.py#L30

From the Series.describe() documentation:

The output will vary depending on what is provided.

Let's investigate the output, given the input:

I can think of several next steps:


I asked about which next step to take in this Slack thread. The consensus view was that because it's unlikely a researcher would release the HTML report from L4, it's acceptable to continue to call Series.describe(), with flags in the HTML report and the docstring.


It's worth summarizing the differences WRT redaction, descriptive statistics, and bar charts/histograms between "previously" (v2.0.2) and "now" (this PR).

Previously:

Now:

Notice that: the approach to redaction is now consistent with Brandt et al.; descriptive statistics are always generated.

iaindillingham commented 2 years ago

I've said that this PR fixes #50. You might want some evidence for that claim 🙂 So, let's compare the size of the HTML document before and after these changes. Specifically: