opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
7 stars 3 forks source link

Investigate alternative query strategies for pseudo-event-level data #989

Open evansd opened 1 year ago

evansd commented 1 year ago

By "pseudo-event-level data" we mean data that is naturally represented as having multiple rows per-patient but which we represent using a fixed (but potentially large) number of columns in a one-row-per-patient table. For example, instead of:

patient_id date value
1 2020-01-01 10
1 2020-02-01 11
1 2020-03-01 12
2 2020-01-01 20
2 2020-02-01 21

We would structure it as:

patient_id date_1 value_1 date_2 value_2 date_3 value_3
1 2020-01-01 10 2020-02-01 11 2020-03-01 12
2 2020-01-01 20 2020-02-01 21 NULL NULL

The work-around that has developed for this (in both Cohort Extractor and Data Builder) is to chain together a series of single-event queries. That is:

The logic for doing this can be wrapped up in some Python which dynamically generates the variable definitions, but this approach nevertheless has some series shortcomings:

While we would eventually like to have first class support for event-level data this will require some significant thought and restructuring to implement, and so it's worth seeing if we can support pseudo-event-level data in a more efficient and ergonomic manner.

In pursuit of this I have done some basic performance testing of two alternative forms of SQL query. I'm not looking in detail at the changes which would be needed to ehrQl/the query model/the query engines in order to make this all work; but I have a reasonably worked out plan in my head should we decide to follow up.

Comparison

I based the test on the study query which caused us to look at this issue now. It attempts to retrieve four columns for each of the 26 most recent ONS_CIS events during a certain time period.

I compared three approaches, using some basic Python scripts to generate the repetitive SQL.

Chained Sorts The approach described above. This involves creating a new intermediate table for each event number and then JOINing them all the end. Example here.

Wide Table Turning the results of the sort into a single, wide, one-row-per-patient intermediate table using a long list of CASE statements, and then doing a single JOIN at the end. Example here.

Multiple Joins Writing the results of the sort into a many-rows-per-patient table and then JOINing repeatedly to this table, but with a different row number condition on each JOIN. Example here.

I initially intended to look at using the PIVOT function but, as I understand it, I don't think this gives us a particularly nice way of retrieving multiple columns from each event.

Here are the headline results

Query  CPU (seconds) Elapsed (seconds)
Chained Sorts 284.1 36.4
Multiple Join 45.6 12.1
Wide Table 46.7 7.6

The full scripts, SQL and timing breakdowns are available in this directory.

You can paste the .txt timing files into the handy statisticsparser.com site which can format them nicely and generate totals.

evansd commented 1 year ago

Further comparisons

I have compared the same three approach but extracting 52 events from the Appointment table over a two year period. The breakdown is:

Query CPU time Elapsed time
Chained Sorts 8hrs 20m 2hrs 9m
Multi Join 2hrs 22m 55m
Wide Table 1hr 4m 1hr 48m

The full scripts, SQL and timing breakdowns are available in this directory.

Note that these are extremely unscientific, single-shot measurements which will be affected by whatever else was running at the time. In particular, note the inversion of CPU vs elapsed time vs "Multi Join" compared with "Wide Table". I think CPU time is probably the more relevant metric here. But I'm going to try re-running these to give a second point of comparison.

So far, the picture seems to be that the improvements available here are non-trivial, but also not game-changing.

evansd commented 1 year ago

Second runs on the above

Query CPU time Elapsed time
Chained Sorts n/a n/a
Multi Join 2hrs 22m 37m
Wide Table 1hr 8m 1hr 42m

This suggests that the CPU time/Elapsed time inversion was not a complete anomaly. Presumably the difference is I/O time?