RCT dataset support (GermDefence)

sebbacon commented 3 years ago

New collaborators from Bristol are joining their GermDefence RCT dataset (which was exposing GPs / practices to a website-based intervention) into primary care data, to investigate health outcomes (the incidence of respiratory tract infections, COVID-19 diagnoses, COVID-19 symptoms, and gastrointestinal infections, and the number of primary care consultations and antibiotic prescriptions)

The raw data columns (27 of them) are pasted below.

They are only interested in data at a practice, not patient level.

So with the study definition model they'd want something like patients.with_germdefence_rct(), then define one variable for each outcome. How would they get the intervention data? Would we add something to registered_practice_as_of, along the lines of registered_practice_as_of(date, returning='germdefence__n_pages_viewed_mean')`?

The novel problem to consider here is that this is one-off; no-one else will have a use-case for germdefence data (in fact, no-one else should be able to access it at all, ideally). So baking it into the core study definition API seems very wrong.

We should also consider that longer term we might get more of these per-patient or per-practice variables that are only to be used for a single project.

Maybe there is a generic study definition call we could define, for random stuff like this, e.g. patients.with_custom_property(dataset='germdefence') or whatever.

Or perhaps we'd implement this as a backend (e.g. tpp-germdefence). And backends get their own subsets of study definition variables; and we implement an action to allow you to join data from different backends (assuming they're on the same server). The nice thing about this is we'd get to lock down access per-user.

columns:

code,
ccg_code,
listsize_Jan21,
deprivation_score,
intervention,
IntCon,
List_size,
IMD_rank,
IMD_decile,
MeanAge,
MedianAge,
Av_rooms_per_house,
Minority_ethnic_total,
n_times_visited_mean,
n_pages_viewed_mean,
total_visit_time_mean,
prop_engaged_visits,
n_engaged_visits_mean,
n_engaged_pages_viewed_mean_mean,
N_visits_practice,
group_mean_behaviour_mean,
group_mean_intention_mean,
N_completers_RI_behav,
N_completers_RI_intent,
hand_behav_practice_mean,
hand_intent_practice_mean,
N_completers_HW_behav,
N_goalsetting_completers_per_practice

evansd commented 3 years ago

I wonder if they need to access this data via the study definition at all, given that it's not patient level data? It feels like we need two things:

A standard cohort extraction with the usual clinical variables where the population is patients registered with the relevant practices over the relevant period.
A way of dumping the entire contents of that practice-level germdefence table into the workspace so they can join on it in their analytic code.

We might be able to solve 1 just by adding an option to do something like "where registered_pratice_pseudo_id in list". Or maybe we'd need to add something custom to do a join on that table. But in any case that's the only custom thing we'd need to add to the study definition.

For 2, I can see a general use for some mechanism for running named, canned queries which return something other than patient-level data. For instance, it might be useful to be able to dump some of the data dictionaries. I guess this would be still be done via the cohortextractor but it would be much simpler than a study definition. Really just "run query X and dump results to file Y".

~~If they had some way of extracting the cohort of patients registered with the relevant practices in the relevant period then they could dump the contents of this table via some separate mechanism.~~

evansd commented 3 years ago

On the permissions thing, my understanding is that we want to move to a world where we have metadata on each table indicating its provenance and this is combined with the user's specific IG permissions to determine whether they have access to it. I guess this would fit neatly into that model. If we go down the "canned queries" route then we'd want the equivalent metadata attached to each query so we can apply the same permissions system.

sebbacon commented 3 years ago

If they had some way of extracting the cohort of patients registered with the relevant practices in the relevant period then they could dump the contents of this table via some separate mechanism.

Not following this. Do you mean if they had a way of saying "dump table X to file Y" then it would also be good to say "use column A from file Y to select patient ids in this study definition"?

evansd commented 3 years ago

Argh, sorry! That was left over from what I initially started writing before I did the 1. 2. thing. Please ignore!

sebbacon commented 3 years ago

So yes, I agree

This could potentially be as simple as:

pseudo_ids = pd.read_csv("data/rct.csv")["patient_pseudo_id"]

study = StudyDefinition(
    population=patients.registered_in_list_of_practices_between(
        "2019-02-01", "2020-02-01", pseudo_ids=pseudo_ids
    ),
)

# more study definition good stuff

and then a later python action that joins rct_data to data/rct.csv.

But only if the RCT data can be released as data in the repo.

sebbacon commented 3 years ago

Aha: there is (perhaps) no IG issue regarding this solution, but there is a political one: we've promised ~to~ not to allow any GP practice to be identified.

So this solution does require the "raw" file only to be available within a backend. As matching on pseudo id can only be done by the EHR vendor, it makes sense for them to stick it in a table.

So then we get something like:

study = StudyDefinition(
    population=patients.registered_in_list_of_practices_between(
        "2019-02-01", "2020-02-01"
    ),
)

non_patient_queries = [
    TableExtract(table='germdefence_rct', format='feather', output_path='data/germdefence_rct.csv')
]

iaindillingham commented 3 years ago

I'd appreciate a quick conversation about this, because I'm not confident I understand what I've called candidate solutions below. The following is me summarizing the discussion.

Problem context

The dataset is specific to this study (27 variables; practice-level not patient-level)
The approach should generalise to other studies
Old cohort-extractor, new cohort-extractor, or both?

Candidate solutions

Add a function to patients that returns patients registered by a subset of practices e.g. registered_in_list_of_practices_between.

Is this a convenience? It ensures we're not extracting more patients than we need, but the join (see below) would reduce the set of patients to the subset of patients anyway.

Join the set of patients to the RCT data (a subset of practices).

We can't dump the dataset to the workspace ("canned query") because we've promised not to allow any practices to be identified. (I think your comment is missing a not, @sebbacon. Have I misunderstood?)
The dataset must be loaded into a table, so that it can be joined by pseudo patient ID.
The join must be made by cohort-extractor; it can't be made by e.g. a reusable action.
If joining by pseudo patient ID can only be done by the EHR vendor, how would dumping the dataset to the workspace have worked?

Could you explain how the non_patient_queries variable is used, @sebbacon?

I assume an instance of TableExtract dumps the dataset to the workspace. I thought we couldn't do this?

CarolineMorton commented 3 years ago

Just my two cents, i think we decide to go with the study definition model then patients.with_custom_property(dataset='germdefence') is far preferable to patients.with_germdefence_rct() as it does not limit us so much in the future and is a nice generic API. Would opt for something like patients.in_rct(dataset='germdefence')

sebbacon commented 3 years ago

Old cohort-extractor, new cohort-extractor, or both?

Old!

I think your comment is missing a not, @sebbacon

Correct, corrected!

If joining by pseudo patient ID can only be done by the EHR vendor, how would dumping the dataset to the workspace have worked?

In this case, we're joining by pseudo practice id. So the earlier idea was: the RCT data is just supplied by the researcher, committed in the github repo, and we just have to join on that. But then we realised we're not allowed to identify practices under our current IG arrangements, which would prevent the repo from being published.

Let's say you have this study definition:

study = StudyDefinition(
    population=patients.registered_with_one_practice_between(start_date, end_date),
    has_gi_infection = patients.with_these_clinical_codes(blah),
    has_positive_covid_test = patients.with_positive_covid_test(blah).
)

That would create an input.csv with one row per patient.

We are only interested in patients in a given set of practices, listed in the RCT CSV we're supplied with.

Yes, we could reduce that set of patients purely with join semantics. But we would also need to support aggregation at the same time. So it seems easier to do that join in pandas or whatever. And you need way less memory if we filter the patients at source.

That's why I suggested something along the lines of the cohortextractor spitting out two files, one a filtered list of patients, the other the "raw" data (albeit in feather format in this example).

I am sure there are other ways of doing this, that may be much better.

Hopefully that explains it a bit better?

iaindillingham commented 3 years ago

Getting there! 🙂 I think there are two key questions:

How do we indicate that we want the population to include only patients associated with practices taking part in the GermDefence RCT?
How do we indicate which of the GermDefence RCT variables we want to extract?

So far, it seems we have two approaches.

The mostly downstream (i.e. further from cohort-extractor) approach:

Add functionality to extract variables from non-patient datasets to cohort-extractor (e.g. TableExtract).
Extract two datasets: one contains the set of patients; one contains the set of practices; both are highly sensitive.
Join/filter and aggregate in an action, e.g. using Pandas.

The mostly upstream (i.e. nearer to cohort-extractor) approach:

Add functionality to join/filter patient and non-patient datasets to cohort-extractor.
This functionality still returns a one-row-per patient dataset. For example:

study = StudyDefinition(
    population=patients.satisfying(
        "is_registered AND is_part_of_germdefence",
        is_registered=patients.registered_with_one_practice_between("2019-02-01", "2020-02-01"),
        is_part_of_germdefence=patients.part_of("germdefence"),
    ),
    n_pages_viewed_mean=patients.with_custom_property("n_pages_viewed_mean"),
)

Aggregate in an action, e.g. using Pandas.

iaindillingham commented 3 years ago

Following today's meeting with @CarolineMorton, @evansd, and @sebbacon we're going to explore the mostly upstream approach. Whilst the pseudo-code above is more-or-less correct, we're not going to introduce a function like patients.with_custom_property and instead extend patients.registered_practice_as_of. For example:

study = StudyDefinition(
    index_date="2020-02-01",
    population=patients.satisfying(
        "is_registered AND is_part_of_germdefence",
        is_registered=patients.registered_as_of("index_date"),
        is_part_of_germdefence=patients.part_of("germdefence"),  # New function
    ),
    n_pages_viewed_mean=patients.registered_practice_as_of(
        "index_date",
        returning="germdefence.n_pages_viewed_mean",  # New value
    ),
)

We think this approach is compatible with cohort-extractor v2 -- a researcher could rewrite the study definition using v2 and expect it to return a dataset of the same shape.

What's new?

patients.part_of a function that returns a boolean indicating whether each patient was part of the given dataset. (Is dataset the correct noun?)
"germdefence.n_pages_viewed_mean" a string indicating the value to be returned from the given dataset. Should this be a dotted string? ("dataset.some_variable") A dunder string? ("dataset__some_variable") Something else?

It would be great to get some early feedback from @HelenCEBM, who will be co-piloting, on the proposal. Thanks Helen!

sebbacon commented 3 years ago

For the satisfying part, I was thinking more like this:


study = StudyDefinition(
    index_date="2020-02-01",
    population=patients.satisfying(
        "is_registered AND is_part_of_germdefence",
        is_registered=patients.registered_as_of("index_date"),
        is_part_of_germdefence=patients.registered_practice_as_of(
        "index_date", returning="germdefence.exists",  
    ),
    n_pages_viewed_mean=patients.registered_practice_as_of(
        "index_date",
        returning="germdefence.n_pages_viewed_mean",  # New value
    ),
)

For the returning syntax, the actual schema provided by TPP might give us some ideas.

One thought for now: namespacing for readability? rcts.germdefence.n_pages_viewed_mean or even rcts.practices.germdefence.n_pages_viewed_mean?

iaindillingham commented 3 years ago

Quick note: TPP are planning to add three new tables, as well as to update the schema table, this week. Eventually, they will appear in the OpenSAFELY-TPP database schema report.

sebbacon commented 3 years ago

From Chris:

There’s three generic tables for the GermDefence work so you can get working on it - as below:

OpenCorona..ClusterRandomisedTrial
(
    Organisation_ID INT,
    TrialNumber INT,
    TrialArm VARCHAR(100)
)

OpenCorona..ClusterRandomisedTrialDetail
(
    Organisation_ID INT,
    TrialNumber INT,
    Property VARCHAR(1000),
    PropertyValue VARCHAR(1000)
)

OpenCorona..ClusterRandomisedTrialReference
(
    TrialNumber INT,
    TrialName VARCHAR(1000),
    TrialDescription VARCHAR(1000),
    CPMSNumber INT
)

I made a change to remove a duplicate property from their data. I’ll re-run to populate later this morning / early afternoon.

iaindillingham commented 3 years ago

I had a quick call with Chris; here's the summary.

ClusterRandomisedTrialReference is the entity table; one row per RCT
ClusterRandomisedTrialReference.TrialNumber is an arbitrary PK
ClusterRandomisedTrialReference.CPMSNumber is how the outside world reference the RCT; we could think of this as a non-arbitrary PK, but probably shouldn't
ClusterRandomisedTrial is the through table; it relates RCTs to organisations (ClusterRandomisedTrialReference to Organisation)
ClusterRandomisedTrial.TrialArm describes within which arm (group) the organisation participated (e.g. either intervention or control, for a two-arm trial). Because different RCTs will have different conventions for naming different arms, this column is a string.
ClusterRandomisedTrialDetail is the mapping table that relates properties (variables) to their values

evansd commented 3 years ago

For reference, here is the current list of available properties:

Av_rooms_per_house
deprivation_pctile
group_mean_behaviour_mean
group_mean_intention_mean
hand_behav_practice_mean
hand_intent_practice_mean
IMD_decile
IntCon
MeanAge
MedianAge
Minority_ethnic_total
N_completers_HW_behav
N_completers_RI_behav
N_completers_RI_intent
n_engaged_pages_viewed_mean_mean
n_engaged_visits_mean
N_goalsetting_completers_per_practice
n_pages_viewed_mean
n_times_visited_mean
N_visits_practice
prop_engaged_visits
total_visit_time_mean

Source: Screenshot from 2021-09-17 11-00-11

evansd commented 3 years ago

This is now available in the latest version of cohortextractor, accessed via the patients.registered_practice_as_of() function, see in particular the Cluster RCT section.

opensafely-core / cohort-extractor

RCT dataset support (GermDefence) #599