Handle categoricals in binary output formats

evansd commented 2 years ago

This was a thing we failed to get right with cohortextractor and I think it's important to get it right in Data Builder.

Often datasets will have columns of string type where the strings can only be one of a fixed set of values. We want to encode these in the binary output as categoricals rather than strings because the space saving is enormous. (Admittedly, if we're using compression then the savings in disk space may be fairly minimal, but it's the saving in RAM that's really important here.)

When we attempted this in cohortextractor we tried to build the list of categories on the fly while processing the output and then convert the column to a categorical at the end before writing out the data.

This has some serious disadvantages:

It means we need to buffer the whole output in memory before writing out, instead of being able to stream it to disk in chunks.
It meant that the order of the categories was not stable and would change depending on the order in which we happened to first encounter each category in the output. This was particularly problematic for Stata and meant we had to abandon use of categorical variables altogether there.

So I think we should only use categoricals where we know upfront, before retrieving the data, the full set of possible categories.

Fortunately, for the two most common forms of string variable I think we can do this. The two forms are: the output from a categorised codelist; the result of a set of logical conditions in a case/categorised_as expression.

In the Data Builder query model both of these are implemented using the Case operation, so this gives us a fairly simple rule:

Where a variable is the result of a Case operation, all of whose values are static strings then that variable is a categorical and those strings are its categories.

This does leave a class of string variable that won't be treated as a categorical and that is codes drawn directly from a column of type CTV3, SNOMED-CT etc. The problem here is that, without bundling with databuilder (and keeping up-to-date) the entire set of possible codes for each coding system, we can't determine in advance what the set of required categories will be.

However my sense is that it's relatively rare to attempt to return individual codes like this as opposed to mapping them to categories and returning the categories.

Related databuilder issues:

benbc commented 2 years ago

Other categoricals:

sex
ethnicity?
anything that's a quantile
rural/urban classification
household type

Actually I'll give up trying to enumerate them because I've found too many in the contracts doc.

Can we handle this with the type system? Any categorical column comes annotated with the possible values; any calculation that introduces a categorical (like Case) can calculate the possible values and propagate an appropriate type.

evansd commented 2 years ago

My sense is that a lot of these (sex being the major exception) end up getting passed through Case anyway so I was wondering if we could get away with not doing this. But yes, handling this via the type system would be the ideal way.

I think there are some complexities this will introduce which I was hoping to avoid thinking about. Or at least, to not making solving them a prerequisite for supporting binary outputs. But maybe I'm being overly pessimistic.

benbc commented 2 years ago

Do we need to have an optimal solution for categoricals in order to support binary outputs?

evansd commented 2 years ago

We certainly don't need an optimal solution, but we do need something I think.

The proposal I made above covers (I think/hope) a reasonable proportion of the cases where categoricals would be appropriate. And it also provides an escape hatch for forcing an output type to be categorical if that turns out to make a significant performance/RAM-usage difference for a particular study (by passing it through an otherwise pointless CASE expression).

So my thought was that we could support that from day 1, and then later doing something more sophisticated with the type system to cover the rest of the cases. That would be a breaking change, but probably we'll have a bunch of those that we realise we need to make which we can bundle up together.

wjchulme commented 2 years ago

This was this original issue for cohort extractor https://github.com/opensafely-core/cohort-extractor/issues/312, including this (non-exhaustive) list of categoricals:

STP (stp_code)
MSOA (msoa_code)
region (nuts1_region_name)
ethnicity
A&E attendance primary diagnosis
care home status
sex

wjchulme commented 2 years ago

The two forms are: the output from a categorised codelist; the result of a set of logical conditions in a case/categorised_as expression

I think there's an advantage to considering a third type, which is those that are just straight-up categorical columns in the database that are returned as-is. For example, sex and region.

There's space savings, but another advantage here is the quality-assurance component (which is irrelevant for the first two forms):

Are there new values in the database that weren't in the original set of possible values? We can test that each time there's a new build.
Is the dummy data consistent with the set of possible values? We can test that (although this may not make sense in databuilder's new dummy data model, which I don't really understand yet).
Are there loads of possible levels (eg MSOA, STP) that are a PITA to construct with dummy data? We can provide access to canned distributions.

But this is additional work, and maybe not a priority, in which case:

[Case] provides an escape hatch for forcing an output type to be categorical

is a neat (temporary!) solution.

wjchulme commented 2 years ago

One potential annoyance with 'Case' is that the outputted categorical levels may not in the desired order, because the case-logic made it convenient to construct the levels in the wrong order. So having a way to specify the desired order (eg levels = ["A", "B" "C", "Unknown"]) might be nice.

For example here a case statement is being used to construct a categorical with possible values ["No events", "event type A", "event type B", "event type C"], but the without the subsequent call to fct_rev() (which reverses the factor levels), the levels are in the wrong order, with `"No events" at the end when it should be at the beginning.

evansd commented 2 years ago

Thanks @wjchulme

This was this original issue for cohort extractor https://github.com/opensafely-core/cohort-extractor/issues/312

I think that one was more about sensible (and more convenient) dummy data generation than it was about binary output formats, though I agree the issues are related.

I think there's an advantage to considering a third type, which is those that are just straight-up categorical columns in the database

Yep, this relates to Ben's point as well. I think we definitely want to handle these too, eventually. I was just wondering how far the simple thing will get us.

One potential annoyance with 'Case' is that the outputted categorical levels may not in the desired order, because the case-logic made it convenient to construct the levels in the wrong order

Ah yes, I actually meant to ask about this. I think there are only two reasonable things we can do here: output the categories in the order they are specified in the Case expression; output them in lexical order. Having some additional API for specifying the order adds a complexity to the system that I don't think we can justify.

I initially lent towards using the order of cases, but maybe there's an argument for going with lexical order. If you care about the order and the categories you're using don't naturally sort how you'd like then you can always add a numeric prefix e.g.

01 Unknown
02 Some Category
03 Some Other Category
...

Any thoughts on that?

wjchulme commented 2 years ago

Thanks, Dave. I think order of cases is best.

For most epi-type studies where you'll be running the data through R/stata scripts, it doesn't really matter what the data builder spits out because you can relabel or reorder the levels however you like afterwards.

But if you're using something like measures and other reusable actions for the whole pipeline, where there's no clear opportunity to intervene to reorder/relabel the levels, then it's important to be able to use the data definition to specify the level order, without potentially unwanted numeric prefixes. So in the same what that you could use case to force things to be categorical, you could also use case to force the levels to be in the right order, which only works if case orders by case order. That might mean having two case statements, one to define the cases and one to reorder them, which is a bit inelegant, but at least it gives you control.

opensafely-core / ehrql

Handle categoricals in binary output formats #631