opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
7 stars 3 forks source link

Handle categoricals in binary output formats #631

Closed evansd closed 2 years ago

evansd commented 2 years ago

This was a thing we failed to get right with cohortextractor and I think it's important to get it right in Data Builder.

Often datasets will have columns of string type where the strings can only be one of a fixed set of values. We want to encode these in the binary output as categoricals rather than strings because the space saving is enormous. (Admittedly, if we're using compression then the savings in disk space may be fairly minimal, but it's the saving in RAM that's really important here.)

When we attempted this in cohortextractor we tried to build the list of categories on the fly while processing the output and then convert the column to a categorical at the end before writing out the data.

This has some serious disadvantages:

So I think we should only use categoricals where we know upfront, before retrieving the data, the full set of possible categories.

Fortunately, for the two most common forms of string variable I think we can do this. The two forms are: the output from a categorised codelist; the result of a set of logical conditions in a case/categorised_as expression.

In the Data Builder query model both of these are implemented using the Case operation, so this gives us a fairly simple rule:

This does leave a class of string variable that won't be treated as a categorical and that is codes drawn directly from a column of type CTV3, SNOMED-CT etc. The problem here is that, without bundling with databuilder (and keeping up-to-date) the entire set of possible codes for each coding system, we can't determine in advance what the set of required categories will be.

However my sense is that it's relatively rare to attempt to return individual codes like this as opposed to mapping them to categories and returning the categories.

Related databuilder issues:

benbc commented 2 years ago

Other categoricals:

Actually I'll give up trying to enumerate them because I've found too many in the contracts doc.

Can we handle this with the type system? Any categorical column comes annotated with the possible values; any calculation that introduces a categorical (like Case) can calculate the possible values and propagate an appropriate type.

evansd commented 2 years ago

My sense is that a lot of these (sex being the major exception) end up getting passed through Case anyway so I was wondering if we could get away with not doing this. But yes, handling this via the type system would be the ideal way.

I think there are some complexities this will introduce which I was hoping to avoid thinking about. Or at least, to not making solving them a prerequisite for supporting binary outputs. But maybe I'm being overly pessimistic.

benbc commented 2 years ago

Do we need to have an optimal solution for categoricals in order to support binary outputs?

evansd commented 2 years ago

We certainly don't need an optimal solution, but we do need something I think.

The proposal I made above covers (I think/hope) a reasonable proportion of the cases where categoricals would be appropriate. And it also provides an escape hatch for forcing an output type to be categorical if that turns out to make a significant performance/RAM-usage difference for a particular study (by passing it through an otherwise pointless CASE expression).

So my thought was that we could support that from day 1, and then later doing something more sophisticated with the type system to cover the rest of the cases. That would be a breaking change, but probably we'll have a bunch of those that we realise we need to make which we can bundle up together.

wjchulme commented 2 years ago

This was this original issue for cohort extractor https://github.com/opensafely-core/cohort-extractor/issues/312, including this (non-exhaustive) list of categoricals:

wjchulme commented 2 years ago

The two forms are: the output from a categorised codelist; the result of a set of logical conditions in a case/categorised_as expression

I think there's an advantage to considering a third type, which is those that are just straight-up categorical columns in the database that are returned as-is. For example, sex and region.

There's space savings, but another advantage here is the quality-assurance component (which is irrelevant for the first two forms):

But this is additional work, and maybe not a priority, in which case:

[Case] provides an escape hatch for forcing an output type to be categorical

is a neat (temporary!) solution.

wjchulme commented 2 years ago

One potential annoyance with 'Case' is that the outputted categorical levels may not in the desired order, because the case-logic made it convenient to construct the levels in the wrong order. So having a way to specify the desired order (eg levels = ["A", "B" "C", "Unknown"]) might be nice.

For example here a case statement is being used to construct a categorical with possible values ["No events", "event type A", "event type B", "event type C"], but the without the subsequent call to fct_rev() (which reverses the factor levels), the levels are in the wrong order, with `"No events" at the end when it should be at the beginning.

evansd commented 2 years ago

Thanks @wjchulme

This was this original issue for cohort extractor https://github.com/opensafely-core/cohort-extractor/issues/312

I think that one was more about sensible (and more convenient) dummy data generation than it was about binary output formats, though I agree the issues are related.


I think there's an advantage to considering a third type, which is those that are just straight-up categorical columns in the database

Yep, this relates to Ben's point as well. I think we definitely want to handle these too, eventually. I was just wondering how far the simple thing will get us.


One potential annoyance with 'Case' is that the outputted categorical levels may not in the desired order, because the case-logic made it convenient to construct the levels in the wrong order

Ah yes, I actually meant to ask about this. I think there are only two reasonable things we can do here: output the categories in the order they are specified in the Case expression; output them in lexical order. Having some additional API for specifying the order adds a complexity to the system that I don't think we can justify.

I initially lent towards using the order of cases, but maybe there's an argument for going with lexical order. If you care about the order and the categories you're using don't naturally sort how you'd like then you can always add a numeric prefix e.g.

01 Unknown
02 Some Category
03 Some Other Category
...

Any thoughts on that?

wjchulme commented 2 years ago

Thanks, Dave. I think order of cases is best.

For most epi-type studies where you'll be running the data through R/stata scripts, it doesn't really matter what the data builder spits out because you can relabel or reorder the levels however you like afterwards.

But if you're using something like measures and other reusable actions for the whole pipeline, where there's no clear opportunity to intervene to reorder/relabel the levels, then it's important to be able to use the data definition to specify the level order, without potentially unwanted numeric prefixes. So in the same what that you could use case to force things to be categorical, you could also use case to force the levels to be in the right order, which only works if case orders by case order. That might mean having two case statements, one to define the cases and one to reorder them, which is a bit inelegant, but at least it gives you control.