Redact after transform - Githubissues

iaindillingham commented 2 years ago

@inglesp very kindly talked through this PR with me today. It's now ready, Peter. I have:

Responded to your comments (see below)
Added an updated screenshot to README.md
Updated this description
Removed the cruft 🧹

If you're happy with it, then I will squash the "feature" commits (from a2623da) and merge.

Problem

Previously, columns in the patient records table were redacted before they were transformed. In the following snippet, df is the patient records table; suppress_low_numbers is the redaction function; series_report is a statistical transformation (it calculates descriptive statistics); and series_graph is a visual transformation (it plots either a bar chart or a histogram).

https://github.com/opensafely-actions/cohort-report/blob/d9237b52874cad29e8dc6884c1e7b1718cec188b/cohortreport/report.py#L79-L87

The redaction function is essentially a group by/count (series.value_counts()) of the values in a column. If the count of any group in the column -- including the "null" group -- is less than a threshold, then the column is redacted. If not, then the column is returned. Consequently, columns with many, low count groups will be redacted; columns that are drawn from continuous distributions often have this characteristic and so often are redacted. Clearly, this is undesirable.

Solution

Now, columns in the patient records table are transformed before they are redacted. More specifically:

~TODO the statistical transformation, where descriptive statistics were calculated, is now more explicit. Previously, series.describe() was used, which calculates different descriptive statistics depending on the data type (dtype) of the series. Now, "safe" descriptive statistics are used; that is, descriptive statistics that don't require redaction.~
the visual transformation has been split into group, redact, and plot functions. A column that is drawn from a continuous distribution will be grouped (binned) and if a group is less than a threshold, the group (not the column) will be redacted. (This design decision was made through discussion with @robinyjpark and @alexwalkerepi.)

The above has also allowed us to:

replace Plotly with Matplotlib (#7). Plotly does not appear to support histograms with pre-binned data; Matplotlib does, with the weights argument to Axes.hist. Remember, too, that a histogram isn't a bar chart.
switch the axes on the bar charts (#26).
~TODO include additional descriptive information (#27)~

Some additional "nice to haves" are:

a table and a chart for each column in the patient records table, even when a group is redacted; it is easier to interrogate the HTML report on L4
saving charts as PNG images; it is easier to interrogate charts on L4
a cleaner API
less code (468 insertions, 1310 deletions)

Fixes #40 Fixes #50

iaindillingham commented 2 years ago

I've mentioned "safe" descriptive statistics, above. I'm planning to use the definitions in:

M. Brandt et al., ‘Guidelines for the checking of output based on microdata research’, Jan. 2010, Accessed: Oct. 28, 2021. [Online]. Available: https://uwe-repository.worktribe.com/output/983615

You'll see that actually, most of the descriptive statistics are unsafe (maximum, minimum, percentiles, means, etc.). However, I'm planning to follow the overall/specific rules of thumb to make output checking easier.

Why Brandt et al.? Because Felix Ritchie, who runs the output checking course, helped develop The Five Safes framework; this framework incorporates Statistical Disclosure Control (SDC); and Brandt et al. is cited as the "formal" definition of SDC.

CarolineMorton commented 2 years ago

This looks good so far to me, and addresses a few things that are long overdue. One thing that does occur to me is this PR does a lot:

Remove Plotly for Matplotlib
Add different descriptive statistics
refactor group, redact, and plot functions and their order
Switch axes on graphs

These are all things that need to happen but I am wondering if small PRs might be better each tied to an issue. I am not sure how practical that will be esp given the change to matplotlib but I think might be easier to understand when looking at this in future.

You mention in the nice-to-haves, a cleaner API. What did you have in mind?

iaindillingham commented 2 years ago

The PR does a lot, but it does it with a little code; indeed, if I removed what's no longer used, then I think this PR would be a net reduction in code. Setting that aside, it's hard to split this PR into multiple PRs because:

Plotly/Matplotlib: Plotly doesn't support histograms with pre-binned data. We need to pre-bin the data, so we can redact it. So, we can either plot a histogram as a bar chart or replace Plotly with Matplotlib. Histograms are emphatically not bar charts 🙂
~Descriptive statistics: They're not different, really; they're just called explicitly and redacted when necessary. So, rather than calling my_series.describe() and letting Pandas decide what is returned, we will call my_series.mean() when the "rules of thumb" in Brandt et al. are satisfied. This could be a separate PR, but it still falls under the "redact before transform" umbrella.~
Refactoring: We can't redact before transform without splitting up some existing functions.
Switch axes: This is the difference between calling bar and hbar. This could be a separate PR, but I don't think much would be gained by omitting the h.

WRT the cleaner API, you can see this emerging in cohortreport.processing: we have main redact and plot functions, which delegate to private helper functions such as _get_unit_mask and _plot_hist. The main and private helper functions can be tested independently. We could also make them available on a series by using register_series_accessor. For example:

# Rather than calling like this...
plot(my_series)
# We could call like this...
my_series.opensafely.plot()

iaindillingham commented 2 years ago

Let's consider the descriptive statistics. Previously, we called Series.describe():

https://github.com/opensafely-actions/cohort-report/blob/d9237b52874cad29e8dc6884c1e7b1718cec188b/cohortreport/series_report.py#L30

From the Series.describe() documentation:

The output will vary depending on what is provided.

Let's investigate the output, given the input:

It's unlikely that a numeric series will be passed to this function, because of the nature of the redaction function (see above). If a numeric series was passed to this function, then it would return the number of values; the mean and standard deviation; the minimum value; the maximum value; and the 25th, 50th, and 75th percentiles. These are "unsafe" statistics according to §2.1 in Brandt et al.
If a series of strings or timestamps was passed to this function, then it would return the number of values; the number of unique values; the most common value and that value's frequency; and, for timestamps, the earliest value and the latest value. These are "unsafe" statistics according to §2.1 in Brandt et al.

I can think of several next steps:

Continue to call Series.describe(), flagging that it produces unsafe statistics in the HTML report and the docstring.
Remove the calculation of descriptive statistics in this PR; introduce the calculation in a new PR. Between this PR and the new PR, cohort-report would generate "safe" bar charts and histograms. They're "safe" because:
- they're generated from frequency tables that satisfy the "rules of thumb" in §2.3 in Brandt et al.;
- they're bitmap images (PNGs) rather than vector images (SVGs).
As above, but in this PR.
Replace the calculation of descriptive statistics with the underlying frequency table in this PR; introduce the calculation in a new PR. (Where frequency table is defined rather loosely as a table of the number of units by group, and so can be used to plot a bar chart or a histogram.)

I asked about which next step to take in this Slack thread. The consensus view was that because it's unlikely a researcher would release the HTML report from L4, it's acceptable to continue to call Series.describe(), with flags in the HTML report and the docstring.

It's worth summarizing the differences WRT redaction, descriptive statistics, and bar charts/histograms between "previously" (v2.0.2) and "now" (this PR).

Previously:

Continuous columns (columns that were drawn from continuous distributions) would almost always have been redacted; no descriptive statistics; no histogram.
Discrete columns would have been grouped by/counted.
- If a count was less than a threshold, then the column was redacted; no descriptive statistics; no bar chart.
- If a count was not less than a threshold, then descriptive statics and a bar chart were generated.

Now:

All columns are grouped: a binning operation is used for continuous columns; a group by/count operation is used for discrete columns.
- If a count is less than 10 or if a count represents greater than 90% of the total, then the group is redacted. Either a bar chart or a histogram is generated; if all groups are redacted, then this will be empty.
Descriptive statistics are generated for all columns.

Notice that: the approach to redaction is now consistent with Brandt et al.; descriptive statistics are always generated.

iaindillingham commented 2 years ago

I've said that this PR fixes #50. You might want some evidence for that claim 🙂 So, let's compare the size of the HTML document before and after these changes. Specifically:

618d190 6.3M
a1e5912 7.7K, with three PNG files of 7.7K, 6.9K, and 6.1K

opensafely-actions / cohort-report

Redact after transform #46

Problem

Solution