Closed iaindillingham closed 2 years ago
I've mentioned "safe" descriptive statistics, above. I'm planning to use the definitions in:
M. Brandt et al., ‘Guidelines for the checking of output based on microdata research’, Jan. 2010, Accessed: Oct. 28, 2021. [Online]. Available: https://uwe-repository.worktribe.com/output/983615
You'll see that actually, most of the descriptive statistics are unsafe (maximum, minimum, percentiles, means, etc.). However, I'm planning to follow the overall/specific rules of thumb to make output checking easier.
Why Brandt et al.? Because Felix Ritchie, who runs the output checking course, helped develop The Five Safes framework; this framework incorporates Statistical Disclosure Control (SDC); and Brandt et al. is cited as the "formal" definition of SDC.
This looks good so far to me, and addresses a few things that are long overdue. One thing that does occur to me is this PR does a lot:
These are all things that need to happen but I am wondering if small PRs might be better each tied to an issue. I am not sure how practical that will be esp given the change to matplotlib but I think might be easier to understand when looking at this in future.
You mention in the nice-to-haves, a cleaner API. What did you have in mind?
The PR does a lot, but it does it with a little code; indeed, if I removed what's no longer used, then I think this PR would be a net reduction in code. Setting that aside, it's hard to split this PR into multiple PRs because:
my_series.describe()
and letting Pandas decide what is returned, we will call my_series.mean()
when the "rules of thumb" in Brandt et al. are satisfied. This could be a separate PR, but it still falls under the "redact before transform" umbrella.~bar
and hbar
. This could be a separate PR, but I don't think much would be gained by omitting the h
.WRT the cleaner API, you can see this emerging in cohortreport.processing
: we have main redact
and plot
functions, which delegate to private helper functions such as _get_unit_mask
and _plot_hist
. The main and private helper functions can be tested independently. We could also make them available on a series by using register_series_accessor
. For example:
# Rather than calling like this...
plot(my_series)
# We could call like this...
my_series.opensafely.plot()
Let's consider the descriptive statistics. Previously, we called Series.describe()
:
From the Series.describe()
documentation:
The output will vary depending on what is provided.
Let's investigate the output, given the input:
It's unlikely that a numeric series will be passed to this function, because of the nature of the redaction function (see above). If a numeric series was passed to this function, then it would return the number of values; the mean and standard deviation; the minimum value; the maximum value; and the 25th, 50th, and 75th percentiles. These are "unsafe" statistics according to §2.1 in Brandt et al.
If a series of strings or timestamps was passed to this function, then it would return the number of values; the number of unique values; the most common value and that value's frequency; and, for timestamps, the earliest value and the latest value. These are "unsafe" statistics according to §2.1 in Brandt et al.
I can think of several next steps:
Series.describe()
, flagging that it produces unsafe statistics in the HTML report and the docstring.I asked about which next step to take in this Slack thread. The consensus view was that because it's unlikely a researcher would release the HTML report from L4, it's acceptable to continue to call Series.describe()
, with flags in the HTML report and the docstring.
It's worth summarizing the differences WRT redaction, descriptive statistics, and bar charts/histograms between "previously" (v2.0.2
) and "now" (this PR).
Previously:
Now:
Notice that: the approach to redaction is now consistent with Brandt et al.; descriptive statistics are always generated.
I've said that this PR fixes #50. You might want some evidence for that claim 🙂 So, let's compare the size of the HTML document before and after these changes. Specifically:
@inglesp very kindly talked through this PR with me today. It's now ready, Peter. I have:
If you're happy with it, then I will squash the "feature" commits (from a2623da) and merge.
Problem
Previously, columns in the patient records table were redacted before they were transformed. In the following snippet,
df
is the patient records table;suppress_low_numbers
is the redaction function;series_report
is a statistical transformation (it calculates descriptive statistics); andseries_graph
is a visual transformation (it plots either a bar chart or a histogram).https://github.com/opensafely-actions/cohort-report/blob/d9237b52874cad29e8dc6884c1e7b1718cec188b/cohortreport/report.py#L79-L87
The redaction function is essentially a group by/count (
series.value_counts()
) of the values in a column. If the count of any group in the column -- including the "null" group -- is less than a threshold, then the column is redacted. If not, then the column is returned. Consequently, columns with many, low count groups will be redacted; columns that are drawn from continuous distributions often have this characteristic and so often are redacted. Clearly, this is undesirable.Solution
Now, columns in the patient records table are transformed before they are redacted. More specifically:
~TODO the statistical transformation, where descriptive statistics were calculated, is now more explicit. Previously,
series.describe()
was used, which calculates different descriptive statistics depending on the data type (dtype
) of the series. Now, "safe" descriptive statistics are used; that is, descriptive statistics that don't require redaction.~the visual transformation has been split into group, redact, and plot functions. A column that is drawn from a continuous distribution will be grouped (binned) and if a group is less than a threshold, the group (not the column) will be redacted. (This design decision was made through discussion with @robinyjpark and @alexwalkerepi.)
The above has also allowed us to:
weights
argument toAxes.hist
. Remember, too, that a histogram isn't a bar chart.Some additional "nice to haves" are:
Fixes #40 Fixes #50