opensafely-actions / cohort-report

Cohort-report generates a report for each variable in an input file
MIT License
0 stars 0 forks source link

Redact after transform #40

Closed iaindillingham closed 3 years ago

iaindillingham commented 3 years ago

At present, we redact columns in the input table before generating summary statistics or bar charts/histograms. This is problematic for columns in the input table that are drawn from continuous distributions (i.e. is_numeric_dtype(s.dtype) == True) because we redact these columns by counting the number of distinct values in each and, in these cases, there will be many distinct values with low counts. To put that another way, it's likely that a real-world input table containing these columns would be redacted.

Instead, we should first transform columns in the input table that are drawn from continuous distributions (by computing their summary statistics, or the data for bar charts/histograms) and then redact this transformed data. Doing so is likely going to involve reworking series_report and series_graph to accept the transformed data. (Note that there's the variable called transformed_series is typed, which is a different transform.)

Thanks @robinyjpark for spotting this issue.

iaindillingham commented 3 years ago

It would make sense to do this after #37 is merged and before #27 is started.

iaindillingham commented 3 years ago

At present, for each column in the input table, cohort-report generates:

It would be good to call the descriptive statistics one-by-one, rather than by calling series.describe. By doing so we could clarify which are "safe" statistics (i.e. don't need to be redacted) and would guard against changes to Pandas (e.g. the introduction of the maximum or minimum value to the result of calling series.describe.)