Open evansd opened 1 year ago
I have compared the same three approach but extracting 52 events from the Appointment
table over a two year period. The breakdown is:
Query | CPU time | Elapsed time |
---|---|---|
Chained Sorts | 8hrs 20m | 2hrs 9m |
Multi Join | 2hrs 22m | 55m |
Wide Table | 1hr 4m | 1hr 48m |
The full scripts, SQL and timing breakdowns are available in this directory.
Note that these are extremely unscientific, single-shot measurements which will be affected by whatever else was running at the time. In particular, note the inversion of CPU vs elapsed time vs "Multi Join" compared with "Wide Table". I think CPU time is probably the more relevant metric here. But I'm going to try re-running these to give a second point of comparison.
So far, the picture seems to be that the improvements available here are non-trivial, but also not game-changing.
Query | CPU time | Elapsed time |
---|---|---|
Chained Sorts | n/a | n/a |
Multi Join | 2hrs 22m | 37m |
Wide Table | 1hr 8m | 1hr 42m |
This suggests that the CPU time/Elapsed time inversion was not a complete anomaly. Presumably the difference is I/O time?
By "pseudo-event-level data" we mean data that is naturally represented as having multiple rows per-patient but which we represent using a fixed (but potentially large) number of columns in a one-row-per-patient table. For example, instead of:
We would structure it as:
The work-around that has developed for this (in both Cohort Extractor and Data Builder) is to chain together a series of single-event queries. That is:
date > first_event_date
;date > second_event_date
;The logic for doing this can be wrapped up in some Python which dynamically generates the variable definitions, but this approach nevertheless has some series shortcomings:
While we would eventually like to have first class support for event-level data this will require some significant thought and restructuring to implement, and so it's worth seeing if we can support pseudo-event-level data in a more efficient and ergonomic manner.
In pursuit of this I have done some basic performance testing of two alternative forms of SQL query. I'm not looking in detail at the changes which would be needed to ehrQl/the query model/the query engines in order to make this all work; but I have a reasonably worked out plan in my head should we decide to follow up.
Comparison
I based the test on the study query which caused us to look at this issue now. It attempts to retrieve four columns for each of the 26 most recent
ONS_CIS
events during a certain time period.I compared three approaches, using some basic Python scripts to generate the repetitive SQL.
Chained Sorts The approach described above. This involves creating a new intermediate table for each event number and then JOINing them all the end. Example here.
Wide Table Turning the results of the sort into a single, wide, one-row-per-patient intermediate table using a long list of CASE statements, and then doing a single JOIN at the end. Example here.
Multiple Joins Writing the results of the sort into a many-rows-per-patient table and then JOINing repeatedly to this table, but with a different row number condition on each JOIN. Example here.
I initially intended to look at using the PIVOT function but, as I understand it, I don't think this gives us a particularly nice way of retrieving multiple columns from each event.
Here are the headline results
The full scripts, SQL and timing breakdowns are available in this directory.
You can paste the
.txt
timing files into the handy statisticsparser.com site which can format them nicely and generate totals.