TL;DR
Currently, completing an LFPSE query takes anywhere between 15-20 min. This delay seems related to the way we query the Events table and the many left joins required to bring all analysis tables together.
Explanation
In lfpse.R, the event_table object is created as follows:
SELECT "q01".*
FROM (
SELECT
"Events".*,
MIN("SubmissionDate") OVER (PARTITION BY "Reference") AS "reported_date",
MAX("Revision") OVER (PARTITION BY "Reference") AS "max_revision"
FROM "analysis"."Events"
) "q01"
WHERE ("Revision" = "max_revision")
What happened above? We worked out the latest revision and reported date for every reference, then recorded what those were in every row of the Events table, and then filtered the table to keep rows for which revision was the latest. The OVER (PARTITION BY) is not necessarily fast and the delay is exacerbated by the final, nested SQL query.
What's the alternative? Figuring out latest revision and reported date from Events can be done via a group by statement, resulting in a table with 3 variables: Reference, Revision (i.e., latest revision), and reported_date. We then use that table as the base for subsequent left joins of all analysis tables (including Events).
TL;DR Currently, completing an LFPSE query takes anywhere between 15-20 min. This delay seems related to the way we query the
Events
table and the many left joins required to bring all analysis tables together.Explanation In lfpse.R, the
event_table
object is created as follows:Which dbplyr will translate to:
What happened above? We worked out the latest revision and reported date for every reference, then recorded what those were in every row of the
Events
table, and then filtered the table to keep rows for which revision was the latest. TheOVER (PARTITION BY)
is not necessarily fast and the delay is exacerbated by the final, nested SQL query.What's the alternative? Figuring out latest revision and reported date from
Events
can be done via a group by statement, resulting in a table with 3 variables: Reference, Revision (i.e., latest revision), and reported_date. We then use that table as the base for subsequent left joins of all analysis tables (includingEvents
).