nhs-england-patient-safety / patient-safety-data-request-template

Template for extracting data from the national patient safety systems in England.
MIT License
0 stars 0 forks source link

Delays in LFPSE queries #7

Closed znogbes closed 1 day ago

znogbes commented 3 weeks ago

TL;DR Currently, completing an LFPSE query takes anywhere between 15-20 min. This delay seems related to the way we query the Events table and the many left joins required to bring all analysis tables together.

Explanation In lfpse.R, the event_table object is created as follows:

event_table<- tbl(con_lfpse, in_schema("analysis","Events")) %>%
  group_by(Reference) |>
  mutate(reported_date = min(SubmissionDate),
         max_revision = max(Revision)) %>%
  ungroup() %>%
  filter(Revision == max_revision)

Which dbplyr will translate to:

SELECT "q01".*
FROM (
  SELECT
    "Events".*,
    MIN("SubmissionDate") OVER (PARTITION BY "Reference") AS "reported_date",
    MAX("Revision") OVER (PARTITION BY "Reference") AS "max_revision"
  FROM "analysis"."Events"
) "q01"
WHERE ("Revision" = "max_revision")

What happened above? We worked out the latest revision and reported date for every reference, then recorded what those were in every row of the Events table, and then filtered the table to keep rows for which revision was the latest. The OVER (PARTITION BY) is not necessarily fast and the delay is exacerbated by the final, nested SQL query.

What's the alternative? Figuring out latest revision and reported date from Events can be done via a group by statement, resulting in a table with 3 variables: Reference, Revision (i.e., latest revision), and reported_date. We then use that table as the base for subsequent left joins of all analysis tables (including Events).

znogbes commented 1 day ago

Addressed in https://github.com/nhs-england-patient-safety/patient-safety-data-request-template/pull/9