Mark which resources are loaded for which patients (i.e. "completion tracking")

This comes from a study need:

Sometimes loading data from the EHR can take a long time and/or can happen in fits and starts. You may take weeks to fully finish loading a set of cohorts.
During that time, studies (including the core__ tables) would probably want to ignore patients that don't have the resources they care about loaded.
i.e. studies want to be able to know whether the Conditions table is accurate for patient X or not - and exclude the patient if not.

My initial thoughts on this are to have the ETL keep a metadata table around, marking which resources are "finished" at the Group level. And then which patients belong to which Groups. That way a study could ask if patient X has Conditions yet.

Brainstorming for that approach:

UX for ETL:
- --auto-mark (looks for log file adjacent to input files, or from URL if ETL is doing export)
- --mark GROUP_ID (alternative option if user wants to override)
- If no mark detected or given, all this logic below will be skipped
- Not in love with "mark"... --mark-complete --mark-finished --complete --finish...
ETLing a resource group:
- writes to a table like etl_complete
- with columns (GROUP_ID, RESOURCE_NAME, ETL_DATE, NEWEST_LAST_UPDATED_DATE, OLDEST_LAST_UPDATED_DATE)
- ETL will add new row when it successfully finishes a run
- Open question: should I update a single unique group_id/resource_name row instead of appending to table? Might make querying less messy? But at cost of some potentially useful log records.
ETLing ~patients~ encounters:
- When uploading ~patients~ encounters, ETL will also write all row IDs to a table like ~etl_patient_groups~ etl_encounter_groups
- with columns (~PATIENT_ID~ ENCOUNTER_ID, GROUP_ID)
- Non-unique, as resources can be in multiple groups

Library can tell all resources we have uploaded at least once for a given encounter by doing something like:

SELECT DISTINCT etl_complete.resource_name
FROM etl_encounter_groups
INNER JOIN etl_complete
ON etl_complete.group_id = etl_encounter_groups.group_id
WHERE etl_encounter_groups.encounter_id = 'xxx'

Assumptions of this approach:
- Exports are full / not-sliced - i.e. if fed a pile of Conditions, the ETL can assume that's all the conditions available at the time (i.e. not sliced by something like severity=mild). Slicing by date is not really supported either, but if we include the bounds of meta.lastUpdated as suggested above, slicing by that field would be fine.
- Users would be able to keep the logs around or manually provide the group name.
- Group exports are a suitable way to cluster users. (I guess that's just a restatement of the above two assumptions)

Further thinking about this.

tl;dr; Let's track Encounters as our primary completeness kernel-of-truth (rather than patients as the description above talks about). Studies can ignore any and all Encounter-linked data if it's not loaded yet.

There are two use cases I can think of:

A researcher running a study wants to know "If I run the study now, is it going to be the most complete view of the data we have available to us?" - i.e. "Is now the best time to run the study, or should I wait for an ongoing data ingestion process to finish?"
An engineer doing data ingestion does not want to cause studies to create misleading data while the ingestion is in-flight. Flipped around: a researcher running a study wants to feel confident that an ongoing ingestion process will not provide misleading data in the meantime. (e.g. no Conditions for an Encounter because they haven't been loaded in yet)

Those are closely related, but a little different. The first is about incompleteness at a broad scale leading to a diminished ability to do meaningful analysis. The second is about incompleteness at a small scale leading to inaccurate analysis.

Solving for the first (incompleteness aka "Is there an ongoing ingestion process?"):

It's hard to solve programmatically unless we added a global "dirty" flag. Which... might be reasonable, but really only useful for this one use case and adds complexity (how should it be managed? and it will definitely get stale)
You could answer this question with a little bit of manual effort with the proposed etl_complete table above in the description - query for the resources you want and a date range and see if you've got all the Conditions loaded up to June of this year, for example.
Or you could just ask the engineers doing the ingestion: "you done yet?" - honestly, this seems the easiest and most natural answer to this problem

Solving for the second (inaccuracy aka "How to stop the data from lying to me during an ingestion process?"):

We want to solve not just for the initial patient load, but also updates of that data (like, now we're loading in just the last month's worth of updates)
For rows that are "unlinked" (think: Device), those are just loose piles of data and any updates (read: new bucket of data being poured on top) can just come in when they come in. If you don't have all the latest updates in the pile yet, that's an incompleteness issue ("do we have it all yet?"), but not a inaccurate outcome.
But for rows that are "linked" (think: Condition.encounter) we want to avoid considering either side of the link until both are available. Or we risk an inaccurate view of the data.
We could mark a patient/group combo as incomplete once we start an ingestion. But:
- that requires some tooling knowledge of ingestion like "ok I'm starting a data update" and "ok I'm done"
- it's disruptive to remove a patient from all consideration until all their data is updated again
Instead, if we identified the clusters of data (is it just Encounter clusters?), we could track those.
- That is, instead of (or in addition to?) the proposed etl_patient_groups table from the description above, where we link patients to groups... maybe we link encounters to groups with an etl_encounter_groups table.
- As Condition groups come in, they get marked as complete and then you can programmatically know that your study (which cares about Conditions in this example) can now use that group's Encounters.

Problem scenarios, tricky to get right even with the above Encounter-oriented thinking:

I'm updating already-ingested group A with a fresh batch of data from this month. New Conditions and Encounters. I load Encounters in. The ETL marks that Encounter X is part of group A. How do we denote that the new Conditions in group A aren't actually loaded yet? Maybe we need some date-based timestamping when we say that X is a part of group A since date Z. But how do we get that correct, depending on the order of ETL runs (condition then encounter or flipped) or incomplete/inaccurate date info in the record's fields.
I export Conditions from the EHR first and then Encounters a day later. I will end up loading a day's worth of Encounters that don't have connected Conditions yet.
I export Encounters first and then Conditions a day later. This one doesn't matter so much.

Some of the above is probably helped a lot by doing resource exports at the same time. And then we could probably try to use transactionTime from the bulk export response as a timestamp. That way, our data is guaranteed to have a comprehensive view at least.

So:

Ask folks to keep the log for the export around to pull a timestamp from (and/or allow the user to enter a timestamp themselves?)
Now every ETL job would need two extra bits of info: the group & the export timestamp for the resources being loaded.
When we mark completion info, we add the timestamp - a study will want to see a newer-or-equal Condition/group update timestamp compared to the time the Encounter first appeared in the group. If the Condition/group timestamp is older, that encounter is not viable.
Update our documentation to encourage exporting resources at the same time, if possible. Or at least, export Encounters first.

This would also let us catch probable-mistakes like loading old data on top of newer data by looking at the export timestamp you are providing. (important in the Cerner context, which doesn't have meta.lastUpdated)

Current status

This mostly works! :tada: ... But you have to opt-in.

You can manually enable this feature on the ETL side and the Library will automatically respect the tracking:

Pass --write-completion to the ETL to turn this feature on.
If your input ndjson folder does not also include a log.ndjson from a Bulk Export (from which the ETL can grab a group name and export timestamp), you will need to also pass in --export-group and --export-timestamp.
You have to be lightly careful about export ordering - you'll want to export your encounters first, before other data.

What does completion tracking actually do again?

The Library core study will ignore Encounters that both:

Have completion info for themselves
- This offers backwards compatibility - any Encounters that aren't registered with completion tracking data will be included in core (i.e. all legacy Encounters will be included, because they won't have tracking data until you re-export their group)
AND do not have completion info for Conditions, DocumentReferences, MedicationRequests, and Observations loaded for the Encounter's group at later-or-equal timestamps as the Encounter's timestamp.
- This indicates an incomplete / in-progress ETL ingestion.
- We look at those resources, because those are the resources that the Library examines - if it started looking at Procedure, we'd probably add Procedure to that list.

See code.

Remaining work

Ideally completion tracking would be enabled by default. But before flipping that switch, this is the remaining work to be done:

Consider doing something smart for the "empty input set" case - you exported group A and got zero Procedures. Ideally we'd still mark Procedures as complete for that group. How do we detect that case (vs not having exported Procedures in the first place)? (See below for more discussion of how to solve this.)
Require the group name & timestamp from somewhere (from log or user) and drop the --write-completion flag
If the user provides a bulk export URL that chops down a group (like a URL that includes _typeFilter), we should probably require an explicit --export-group name instead of auto-detecting the group name from the URL.
Update user docs to mention this feature, and caveats around it (like exporting encounters first)
(Optional) Prevent overwriting newer group data with older data -- we do this for meta.lastUpdated, but this feature would give us the ability to look at the export timestamp and do the same kind of check. Which would help with Epic, which does not provide meta.lastUpdated. This doesn't have to happen before turning this feature on by default, it can happen whenever. Just mentioning it since it's a related feature and would be handy.

Empty input set thoughts

Granted, this might not be very common. But it could happen, so we should try to handle it.
Right now, the ETL would not write any completion tracking info - it can't distinguish between "no export was attempted" and "export happened but we got no data".
The Bulk FHIR export spec curiously discourages servers from indicating the difference by saying that if there is no data for a resource, servers SHOULD NOT return an empty file / output element for the resource.
One solution: The ETL could try to distinguish these cases by looking at the export log and parsing _type from the export URL.
- But what if no _type was provided? (The user exported everything the server had...) Could warn the user in that case and ignore the problem, hoping that there were no zero counts...?
- What if the log file isn't present? Offer a CLI flag to say "no really, mark this group complete"? Or do same warn-and-pray strategy for that case.
Alternative fix: stop running all ETL core tasks by default, which could allow us to assume that if the user passed us --task=procedure, the folder has all the Procedure data for this export, zero or not.
- But that reduces the convenience of the CLI for everyone, just to cater to this edge case.
- And users might not appreciate that we're doing this behind the scenes - I could imagine them copying and pasting a big line with all the tasks named.

smart-on-fhir / cumulus-etl