Open mikix opened 7 months ago
Further thinking about this.
tl;dr; Let's track Encounters as our primary completeness kernel-of-truth (rather than patients as the description above talks about). Studies can ignore any and all Encounter-linked data if it's not loaded yet.
There are two use cases I can think of:
Those are closely related, but a little different. The first is about incompleteness at a broad scale leading to a diminished ability to do meaningful analysis. The second is about incompleteness at a small scale leading to inaccurate analysis.
Solving for the first (incompleteness aka "Is there an ongoing ingestion process?"):
etl_complete
table above in the description - query for the resources you want and a date range and see if you've got all the Conditions loaded up to June of this year, for example.Solving for the second (inaccuracy aka "How to stop the data from lying to me during an ingestion process?"):
Device
), those are just loose piles of data and any updates (read: new bucket of data being poured on top) can just come in when they come in. If you don't have all the latest updates in the pile yet, that's an incompleteness issue ("do we have it all yet?"), but not a inaccurate outcome.Condition.encounter
) we want to avoid considering either side of the link until both are available. Or we risk an inaccurate view of the data.etl_patient_groups
table from the description above, where we link patients to groups... maybe we link encounters to groups with an etl_encounter_groups
table.Problem scenarios, tricky to get right even with the above Encounter-oriented thinking:
Some of the above is probably helped a lot by doing resource exports at the same time. And then we could probably try to use transactionTime
from the bulk export response as a timestamp. That way, our data is guaranteed to have a comprehensive view at least.
So:
This would also let us catch probable-mistakes like loading old data on top of newer data by looking at the export timestamp you are providing. (important in the Cerner context, which doesn't have meta.lastUpdated
)
This mostly works! :tada: ... But you have to opt-in.
You can manually enable this feature on the ETL side and the Library will automatically respect the tracking:
--write-completion
to the ETL to turn this feature on.log.ndjson
from a Bulk Export (from which the ETL can grab a group name and export timestamp), you will need to also pass in --export-group
and --export-timestamp
.The Library core
study will ignore Encounters that both:
core
(i.e. all legacy Encounters will be included, because they won't have tracking data until you re-export their group)See code.
Ideally completion tracking would be enabled by default. But before flipping that switch, this is the remaining work to be done:
--write-completion
flag_typeFilter
), we should probably require an explicit --export-group
name instead of auto-detecting the group name from the URL.meta.lastUpdated
, but this feature would give us the ability to look at the export timestamp and do the same kind of check. Which would help with Epic, which does not provide meta.lastUpdated
. This doesn't have to happen before turning this feature on by default, it can happen whenever. Just mentioning it since it's a related feature and would be handy._type
from the export URL.
_type
was provided? (The user exported everything the server had...) Could warn the user in that case and ignore the problem, hoping that there were no zero counts...?--task=procedure
, the folder has all the Procedure data for this export, zero or not.
This comes from a study need:
core__
tables) would probably want to ignore patients that don't have the resources they care about loaded.My initial thoughts on this are to have the ETL keep a metadata table around, marking which resources are "finished" at the Group level. And then which patients belong to which Groups. That way a study could ask if patient X has Conditions yet.
Brainstorming for that approach: