Closed iaindillingham closed 3 years ago
It would make sense to do this after #37 is merged and before #27 is started.
At present, for each column in the input table, cohort-report generates:
series.describe
It would be good to call the descriptive statistics one-by-one, rather than by calling series.describe
. By doing so we could clarify which are "safe" statistics (i.e. don't need to be redacted) and would guard against changes to Pandas (e.g. the introduction of the maximum or minimum value to the result of calling series.describe
.)
At present, we redact columns in the input table before generating summary statistics or bar charts/histograms. This is problematic for columns in the input table that are drawn from continuous distributions (i.e.
is_numeric_dtype(s.dtype) == True
) because we redact these columns by counting the number of distinct values in each and, in these cases, there will be many distinct values with low counts. To put that another way, it's likely that a real-world input table containing these columns would be redacted.Instead, we should first transform columns in the input table that are drawn from continuous distributions (by computing their summary statistics, or the data for bar charts/histograms) and then redact this transformed data. Doing so is likely going to involve reworking
series_report
andseries_graph
to accept the transformed data. (Note that there's the variable calledtransformed_series
is typed, which is a different transform.)Thanks @robinyjpark for spotting this issue.