opensafely-core / cohort-extractor

Cohort extractor tool which can generate dummy data, or real data against OpenSAFELY-compliant research databases
Other
38 stars 13 forks source link

New variable for returning practice at the last known active registration date #532

Open wjchulme opened 3 years ago

wjchulme commented 3 years ago

@inglesp @acagreen17

Currently in EMIS we don't have patients' practice registration history. The practice ID returned by patients.registered_practice_as_of() corresponds to the practice where the patient was last actively registered in EMIS as of the latest EMIS database build and ignores any value passed to the date argument. If the patient is actively registered on the lastest build date, the variable will return this practice. If a patient is no longer registered at an EMIS practice (death or deregistration) it will return the last practice where they were registered.

It's not possible to (easily) return the equivalent in TPP. So, it would be useful to create a new variable that returns a practice ID that's consistent across backends.

It should also return the latest active registration date, i.e. the deregistration date or, if currently actively registered, the latest build date. This is different to patients.date_deregistered_from_all_supported_practices(date="my_date"), which only returns a deregistration date prior to "my_date" and is blank if the patient is actively registered on "my_date".

I don't think this needs to be any more complicated than

patients.latest_active_registration(
    returning="date"/"pseudo_id",
    return_expectations=...
)

There's no date argument or on_or_before/on_or_after/between arguments because the given date should be fixed at the time of the latest database build.

acagreen17 commented 3 years ago

Thinking long term here (and might be thinking beyond the current use of opensafely)...what about if someone is doing a study using data from before the latest build date (i.e. say they were conducting a study between 2000 and 2010), would not being able to specify a date cause issues here (i.e. would they get a date after their study period?)

evansd commented 3 years ago

I guess if you're doing a historical study then this isn't the function you'd want to use anyway. (And a historical study would be impossible in EMIS as it stands as we don't have historical registration data.)

sebbacon commented 3 years ago

@wjchulme, can you explain the use-case a bit more? i.e. why fixing patients.registered_practice_as_of() so it respects date wouldn't be the right fix?

I'm wondering why you'd want to know the date of a last active registration for a deregistered patient.

I'm not doubting it's useful, but thinking forward to helpful documentation - when should I use patients.registered_practice_as_of() and when should I use patients.latest_active_registration() - might this be a loaded gun for people who might not have thought carefully about what they want?

wjchulme commented 3 years ago

why you'd want to know the date of a last active registration for a deregistered patient.

It's important to know when people were deregistered so we know when we're going to stop seeing any recorded primary care events. This effectively ends a person's "follow-up" in a study, as they're no longer "at-risk" of experiencing a primary care coded event.

why fixing patients.registered_practice_as_of() so it respects date wouldn't be the right fix?

I'm not sure what that fix would look like. One issue is we don't have practice history in EMIS, we only know their most recent practice as of the latest EMIS build date. So currently the only way to get the query to respect the date is to return an error if it's run on any other date than the build date. Even then we can't easily access and use the build date, for instance with patients.registered_practice_as_of(date="latest_build_date"). Perhaps we can add a feature to tell the function to run it on the build date without having to update each time, and it chooses the right build date for the backend. Maybe "latest_build_date" could be a reserved variable name? or we could do something like patients.registered_practice_as_of(use_build_date="True")?

The other thing is if a patient is not registered on "latest_build_date", then TPP will return "" but EMIS will return the practice a patient was last registered with, so there's still an inconsistency that needs resolving.

when should I use patients.registered_practice_as_of() and when should I use patients.latest_active_registration()

On reflection, maybe it would be better to call the new variable something like (but better). But

I considered calling it something like system_consistent_practice() to make clear when it should be used, but latest_active_registration() is more descriptive, and is still a useful name even if/when we have registration history in EMIS

sebbacon commented 3 years ago

why fixing patients.registered_practice_as_of() so it respects date wouldn't be the right fix?

I'm not sure what that fix would look like.

It would look like us getting EMIS to add a registration history table to the database available to us.

wjchulme commented 3 years ago

It would look like us getting EMIS to add a registration history table to the database available to us.

:) That would be great, but my understanding from @inglesp is that this won't happen imminently. In the meantime, we need a way to derive practice ID consistently between backends.

The workaround without backend changes is this:

dereg_date = patients.date_deregistered_from_all_supported_practices(
   on_or_before = "latest_build_date" # "today" would probably be good enough here if that is still supported?
) 

last_active_registration_date = mininum_of(
   "dereg_date", 
   "latest_build_date"
) # does minimum_of work for dates?

practice_ID = patients.registered_practice_as_of(
   date = "last_active_registration_date - 1 day",
   returning="pseudo_id"
)

If run in EMIS, it will just use patients.registered_practice_as_of(), ingoring date. If run in TPP it will use date to match behaviour in EMIS. I think!

sebbacon commented 3 years ago

Ah, OK, now I get it!

There's no particular reason for current EMIS behaviour; it doesn't reflect any particular truth about the underlying data. So we should avoid, if possible, adding functions to our official API specifically to address the way the data happens to be exposed in a build that has been made specifically for us. Otherwise we run the risk I outlined above, I think.

What we do need to address is the fact that people are starting to use the EMIS backend at a time when we've only specifically addressed and/or thought about support for the vaccination coverage report. For example, if the EMIS backend currently ignores date this should at least be documented, somehow. I presume this is for @acagreen17's current factors-associated-with work?

If we can use a workaround we should do for now, rather than address it with specific, new functions.

This also relates to stuff @evansd has been thinking about; it would (as I understand it) make it easier for you to make a reusable, user-defined variable that encapsulates the logic above.

acagreen17 commented 3 years ago

I presume this is for @acagreen17's current factors-associated-with work?

Yes, although long term I think the plan will be to try and run analyses in both TPP and EMIS and combine results. Hence why we were thinking a new function might be better than a workaround.

@wjchulme, @inglesp are you happy for me to implement the workaround to derive practice ID consistently between backends?

wjchulme commented 3 years ago

@sebbacon OK, thanks, that makes sense.

@acagreen17 I don't think the proposed workaround will actually work! I don't think minimum_of() works for dates, and there's no coalesce or case_when equivalents in study definitions to pick out the correct non-null practice ID. The only way I can think to get the right practice ID is to extract two IDs and select the right one in a subsequent script. Am I missing a trick?

acagreen17 commented 3 years ago

As in use

practice_ID_latest_date = patients.registered_practice_as_of(
   date = "latest_build_date",
   returning="pseudo_id"
)

Which would give us a practice id for all EMIS patients but only a practice id for current registered TPP patients. So if a TPP patient is no longer registered at an TPP practice (death or deregistration) it will return a null/blank (?) So for TPP patients who have died or deregistered, to get their last known practice we would use

practice_ID_dereg = patients.registered_practice_as_of(
   date = "dereg_date", #(possibly use - 1 day? And assume this would be the same if we were extracting it on death_date?)
   returning="pseudo_id"
)

And then in a data processing script we would calculate practice_id_last_registered using the above.

acagreen17 commented 3 years ago

Just another thought - would we actually also need a practice_id_at_end = patients.registered_practice_as_of( end_date, returning = "pseudo_id") incase anyone has changed practice between the study end date and latest build date? Or is that overkill/I'm overthinking?

wjchulme commented 3 years ago

What you've said there looks sensible @acagreen17. Just to be clear, you could use something like

practice_ID = dplyr::coalesce(practice_ID_latest_date, practice_ID_dereg)

to combine practices in the script.

I don't think practice_id_at_end is useful because although people may have moved, and so the ID is out of date, we can't currently get that information from EMIS.

sebbacon commented 3 years ago

long term I think the plan will be to try and run analyses in both TPP and EMIS and combine results

Right - but right now, this is definitely long term, so I don't think worth investing too much time yet in getting it exactly right; this will be easier once we've refactored the study definition language and thought about from a higher-level, holistic viewpoint.