smart-on-fhir / cumulus-etl

Extract FHIR data, Transform with NLP and DEID tools, and then Load FHIR data into a SQL Database for analysis
https://docs.smarthealthit.org/cumulus/etl
Apache License 2.0
10 stars 2 forks source link

Mark which resources are loaded for which patients (i.e. "completion tracking") #296

Open mikix opened 7 months ago

mikix commented 7 months ago

This comes from a study need:

My initial thoughts on this are to have the ETL keep a metadata table around, marking which resources are "finished" at the Group level. And then which patients belong to which Groups. That way a study could ask if patient X has Conditions yet.

Brainstorming for that approach:

mikix commented 5 months ago

Further thinking about this.

tl;dr; Let's track Encounters as our primary completeness kernel-of-truth (rather than patients as the description above talks about). Studies can ignore any and all Encounter-linked data if it's not loaded yet.

There are two use cases I can think of:

  1. A researcher running a study wants to know "If I run the study now, is it going to be the most complete view of the data we have available to us?" - i.e. "Is now the best time to run the study, or should I wait for an ongoing data ingestion process to finish?"
  2. An engineer doing data ingestion does not want to cause studies to create misleading data while the ingestion is in-flight. Flipped around: a researcher running a study wants to feel confident that an ongoing ingestion process will not provide misleading data in the meantime. (e.g. no Conditions for an Encounter because they haven't been loaded in yet)

Those are closely related, but a little different. The first is about incompleteness at a broad scale leading to a diminished ability to do meaningful analysis. The second is about incompleteness at a small scale leading to inaccurate analysis.

Solving for the first (incompleteness aka "Is there an ongoing ingestion process?"):

Solving for the second (inaccuracy aka "How to stop the data from lying to me during an ingestion process?"):

mikix commented 5 months ago

Problem scenarios, tricky to get right even with the above Encounter-oriented thinking:

Some of the above is probably helped a lot by doing resource exports at the same time. And then we could probably try to use transactionTime from the bulk export response as a timestamp. That way, our data is guaranteed to have a comprehensive view at least.

So:

This would also let us catch probable-mistakes like loading old data on top of newer data by looking at the export timestamp you are providing. (important in the Cerner context, which doesn't have meta.lastUpdated)

mikix commented 2 months ago

Current status

This mostly works! :tada: ... But you have to opt-in.

You can manually enable this feature on the ETL side and the Library will automatically respect the tracking:

What does completion tracking actually do again?

The Library core study will ignore Encounters that both:

See code.

Remaining work

Ideally completion tracking would be enabled by default. But before flipping that switch, this is the remaining work to be done:

  1. Consider doing something smart for the "empty input set" case - you exported group A and got zero Procedures. Ideally we'd still mark Procedures as complete for that group. How do we detect that case (vs not having exported Procedures in the first place)? (See below for more discussion of how to solve this.)
  2. Require the group name & timestamp from somewhere (from log or user) and drop the --write-completion flag
  3. If the user provides a bulk export URL that chops down a group (like a URL that includes _typeFilter), we should probably require an explicit --export-group name instead of auto-detecting the group name from the URL.
  4. Update user docs to mention this feature, and caveats around it (like exporting encounters first)
  5. (Optional) Prevent overwriting newer group data with older data -- we do this for meta.lastUpdated, but this feature would give us the ability to look at the export timestamp and do the same kind of check. Which would help with Epic, which does not provide meta.lastUpdated. This doesn't have to happen before turning this feature on by default, it can happen whenever. Just mentioning it since it's a related feature and would be handy.

Empty input set thoughts