Investigate alternative query strategies for pseudo-event-level data

evansd commented 1 year ago

By "pseudo-event-level data" we mean data that is naturally represented as having multiple rows per-patient but which we represent using a fixed (but potentially large) number of columns in a one-row-per-patient table. For example, instead of:

patient_id	date	value
1	2020-01-01	10
1	2020-02-01	11
1	2020-03-01	12
2	2020-01-01	20
2	2020-02-01	21

We would structure it as:

patient_id	date_1	value_1	date_2	value_2	date_3	value_3
1	2020-01-01	10	2020-02-01	11	2020-03-01	12
2	2020-01-01	20	2020-02-01	21	NULL	NULL

The work-around that has developed for this (in both Cohort Extractor and Data Builder) is to chain together a series of single-event queries. That is:

define a variable which finds the first event for the patient;
define another variable which finds the next event where date > first_event_date;
define another variable which finds the next event where date > second_event_date;
...

The logic for doing this can be wrapped up in some Python which dynamically generates the variable definitions, but this approach nevertheless has some series shortcomings:

it cannot retrieve multiple events on the same day (or at least, it requires some unique order key in order to not miss events);
it is very computationally inefficient to execute;
in Data Builder, it results in deeply nested query graphs which our (currently very inefficient) graph traversal/modification code simply cannot cope with.

While we would eventually like to have first class support for event-level data this will require some significant thought and restructuring to implement, and so it's worth seeing if we can support pseudo-event-level data in a more efficient and ergonomic manner.

In pursuit of this I have done some basic performance testing of two alternative forms of SQL query. I'm not looking in detail at the changes which would be needed to ehrQl/the query model/the query engines in order to make this all work; but I have a reasonably worked out plan in my head should we decide to follow up.

Comparison

I based the test on the study query which caused us to look at this issue now. It attempts to retrieve four columns for each of the 26 most recent ONS_CIS events during a certain time period.

I compared three approaches, using some basic Python scripts to generate the repetitive SQL.

Chained Sorts The approach described above. This involves creating a new intermediate table for each event number and then JOINing them all the end. Example here.

Wide Table Turning the results of the sort into a single, wide, one-row-per-patient intermediate table using a long list of CASE statements, and then doing a single JOIN at the end. Example here.

Multiple Joins Writing the results of the sort into a many-rows-per-patient table and then JOINing repeatedly to this table, but with a different row number condition on each JOIN. Example here.

I initially intended to look at using the PIVOT function but, as I understand it, I don't think this gives us a particularly nice way of retrieving multiple columns from each event.

Here are the headline results

Query	CPU (seconds)	Elapsed (seconds)
Chained Sorts	284.1	36.4
Multiple Join	45.6	12.1
Wide Table	46.7	7.6

The full scripts, SQL and timing breakdowns are available in this directory.

You can paste the .txt timing files into the handy statisticsparser.com site which can format them nicely and generate totals.

evansd commented 1 year ago

Further comparisons

I have compared the same three approach but extracting 52 events from the Appointment table over a two year period. The breakdown is:

Query	CPU time	Elapsed time
Chained Sorts	8hrs 20m	2hrs 9m
Multi Join	2hrs 22m	55m
Wide Table	1hr 4m	1hr 48m

The full scripts, SQL and timing breakdowns are available in this directory.

Note that these are extremely unscientific, single-shot measurements which will be affected by whatever else was running at the time. In particular, note the inversion of CPU vs elapsed time vs "Multi Join" compared with "Wide Table". I think CPU time is probably the more relevant metric here. But I'm going to try re-running these to give a second point of comparison.

So far, the picture seems to be that the improvements available here are non-trivial, but also not game-changing.

evansd commented 1 year ago

Second runs on the above

Query	CPU time	Elapsed time
Chained Sorts	n/a	n/a
Multi Join	2hrs 22m	37m
Wide Table	1hr 8m	1hr 42m

This suggests that the CPU time/Elapsed time inversion was not a complete anomaly. Presumably the difference is I/O time?

opensafely-core / ehrql