opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
6 stars 3 forks source link

Include "archived" versions of secondary care data tables (APCS, EC, OPA) #2038

Closed evansd closed 2 months ago

evansd commented 2 months ago

The secondary care data tables (APCS, EC, OPA and their various associated tables) are getting too large to be transferred in full each time.

The current proposed solution is to split these into a "live" dataset covering 2022 onwards and an "archive" dataset covering dates prior to 2022. The live dataset would be regularly updated and the archived one would stay static once created.

Obviously the same problem is going to reoccur a few years down the line at which point we can either agree to shift the live/archive cutoff point or possibly we'll have some totally different solution.

The specific tables in question are:

APCS
APCS_Cost
APCS_Der

EC
EC_Comorbidities
EC_AlcoholDrugInvolvement
EC_Cost
EC_Diagnosis
EC_Investigation
EC_PatientMentalHealth
EC_Treatment

OPA
OPA_Cost
OPA_Diag
OPA_Proc

For each of these, there will be a table with an identical schema and the suffix _ARCHIVED containing data prior to 2022.

We intend to keep the user-facing ehrQL tables the same and use a UNION ALL query in the backend definition to combine results for the live and archive tables. It's possible that eventually the performance of this approach will become unacceptable, but we'll tackle that problem as and when it arises.

Implementation notes

At present, the archived tables exist in production but are empty. This means that we can deploy code which reads from both live and archive tables and then TPP can start populating the archive tables.

An important consideration is that the data in the live and archive tables will overlap. That is, the live tables should only contain data from 2022 onwards, but the archive tables will also contain some data beyond 2022. Naively combining the two tables would lead to duplicates. So we will need to pick a cutoff date and filter each set of tables appropriately before UNIONing.

One further wrinkle is that these tables have more than one date column so it's not totally clear which date column is the best one to partition on. However, I don't think it really matters as long as we're consistent.

For the backend table definitions, we want to replace queries of the form:

SELECT foo FROM OPA

With queries of the form:

SELECT foo FROM OPA
WHERE OPA.Date_Column >= "20220101"
UNION ALL
SELECT foo FROM OPA_ARCHIVED
WHERE OPA_ARCHIVED.Date_Column < "20220101"

We can use the @QueryTable.from_function decorator to reduce duplication in the SQL definitions. For inspiration see: https://github.com/opensafely-core/ehrql/blob/90f7eeafe9d24ff0de223153ff4b9005043867e6/ehrql/backends/tpp.py#L582-L607

Relevant Slack threads

https://bennettoxford.slack.com/archives/C010SJ89SA3/p1716280499463089 https://bennettoxford.slack.com/archives/C069YDR4NCA/p1716208559335039