opensafely-actions / dataset-report

dataset-report generates a report for each dataset in an input directory
Other
0 stars 0 forks source link

Summarize categorical columns #23

Open iaindillingham opened 2 years ago

iaindillingham commented 2 years ago

From @andrewscolm. Thanks, Colm 🙂

If we implemented #22, then we would struggle to summarize counts for each category, as some categorical columns would have more categories than other categorical columns. However, we could summarize the number of unique values and the number of missing values. As @wjchulme says about the number of unique values:

Usually because if it's 1, you know something has probably gone wrong. But just in general if it's lower/high than expected

Do we also need to summarize counts for each category?

andrewscolm commented 2 years ago

Thanks @iaindillingham, would you be able to implement something like the 'top_counts' column in Will's example? It summarizes the counts for a maximum of 4 categories. This would be really helpful to see if there have been any major mistakes.

If that isn't possible then, as Will stated, the number of unique values is still a useful insight.

wjchulme commented 2 years ago

Not sure if this was clear before, but for categoricals a table of counts, a la cohort-report, is often still really useful.

So both a single-row-per-variable format to have an overview of the entire dataset (split by variable type) and count tabulation for relevant categorical variables would be useful. Could also simplify things by just tabulating all variables with fewer than ~20 unique values, to avoid eg STPs or MSOAs being tabulated and to ensure categorical-as-int variables are still included. These tables would live in a separate document.

Obv with redaction!