opensafely-core / cohort-extractor

Cohort extractor tool which can generate dummy data, or real data against OpenSAFELY-compliant research databases
Other
39 stars 13 forks source link

Support for multiple rows per patient #136

Open wjchulme opened 4 years ago

wjchulme commented 4 years ago

Currently, study_definition.py only accommodates one-row-per-patient datasets. It would be good to be able to construct one-row-per-patient-per-event study datasets to facilitate the use of epi methods that use repeated observations on the same individuals. This almost certainly means time-indexed repeat observations (eg for time-varying treatments, recurrent events, etc).

In theory you could still output all the relevant data in one-row-per-patient form, for example like this:

obs1date, obs1value1, obs1value2, obs1values3, obs2date, obs2value1, obs2value2, obs2value3, obs3date, ...

but this i think would be inefficient since you'd have to do some faffy munging to get the data in that format from the level 2 data, only to then have to undo it all for the analysis (since software for longitudinal methods typically needs a many-rows-per-patient dataset)

I'm assuming this won't "just work" within the current StudyDefinition function, because a) you will define a variable that returns multiple values per patient and b) it also needs to return the time index and c) the time index will need to be merged across different timy-varying variables. But if it somehow does work then the rest of this post can be ignored!

One solution would be to create two datasets in study_definition.py: a one-row-per-patient dataset for all variables that are assumed not to be time-varying (sex, ethnicity, ...) as happens currently with StudyDefinition; and a many-rows-per-patient dataset that contains all the time-indexed info using a new function, say StudyDefinitionMulti (but not that). The datasets can be joined on patient ids afterwards. This would mean there's no need to tinker with StudyDefinition. It's also useful from a research design perspective as it forces the study author to think about how to structure the data for whatever statistical models are being run. For example, do I need pt_id="1", date_of_death="2020-01-01" in the one-row-per-patient dataset or pt_id="1", date="2020-01-01", event_type="death" in the many-rows-per-patient dataset.

An alternative would be to allow the time-indexed variables to be returned all within one cell for each patient, so that each cell contains an indexed vector of repeated observations and not just a single value. These can be then unnested for the many-rows-per-patient dataset. But I've no idea how well Python handles vectors/arrays nested within dataframe cells.

Things to consider:

sebbacon commented 4 years ago

Just to make this real for me can you work through an example? I don't quite understand what sort of things obs1date, obs1value1, obs1value2, obs1values3, obs2date, obs2value1, obs2value2, obs2value3, would be.

wjchulme commented 4 years ago

We can either structure the data like this:

ptid date value
1 2020-01-01 1
1 2020-01-05 2
2 2020-01-04 5
2 2020-01-07 3
2 2020-01-15 4

Or like this:

ptid date1 value 1 date2 value2 date3 value3
1 2020-01-01 1 2020-01-05 2
2 2020-01-04 5 2020-01-07 3 2020-01-15 4

The former is much easier to work with programmatically, and is the format typically expected for stats routines in R, stata, etc

Neither option is possible with study definitions at the mo. At least, not for returning date-value pairs for any given event that can occur more than once (though see Measures for a solution to a related problem).

sebbacon commented 4 years ago

Sorry, I meant a real-life example of what value might be. Just so I can cement this in my head!

wjchulme commented 4 years ago

Could be anything. Common physiological investigations like cholesterol, blood pressure, creatinine, liver function tests, weight. Or reason for hospital admission. Or the result of a screening test for aortic anyurism.

sebbacon commented 4 years ago

So as a straw man what would it look like in a study definition? Blood pressure on the first of every month between date A and date B? Or the mean blood pressure during each quarter between date A and date B? Both? Other things?

wjchulme commented 4 years ago

Typically we'd want the actual measurement and the date it was taken, rather average measurements or values at regular time points (which is why in the example above I used irregular event dates, and a different number of events for different patients). So it's actually basically a subset of the event-level data.

I'm not sure how the study definition would look, as I think it would depend on whether we want to separate time-varying data and time-invariant data into two or more separate datasets or not (see OP for details). It also depends if you want a separate dataset for each time-based variable or not, and this really depends on the use-case. Easier to go through some options:

One massive one-row-per-patient dataset:

StudyDefinition(
  population = patients.all(), 
  sex = patients.sex(),
  sbp = events.with_these_codes(
    sbp_codelist,
    returning = "value",
    date_suffix="_date",
    value_suffix = "_value"
  ),
  dbp = events.with_these_codes(
    dbp_codelist,
    returning = "value",
    date_suffix="_date",
    value_suffix = "_value"
  ),
  admission = events.hosp_admission(
    returning="reason_for_admission",
    date_suffix="_date",
    value_suffix = "_reason"
  ),
)
ptid sex sbp_date1 sbp_value1 sbp_date2 sbp_value2 sbp_date3 sbp_value3 dbp_date1 dbp_value1 dbp_date2 dbp_value2 dbp_date3 dbp_value3 admission_date1 admission_reason1 admission_date2 admission_reason2
1 F 2020-01-01 110 2020-01-05 120 2020-01-01 70 2020-01-05 80 2019-06-24 "fall" 2019-12-25 "stroke"
2 M 2020-01-04 160 2020-01-07 170 2020-01-15 180 2020-01-04 80 2020-01-07 80 2020-01-15 85 2018-10-20 "MI"

Here the events library contains a set of extractor functions that return a multiple columns, one new column for each additional event.

The problem with this is that it doesn't store data "tidily" nor efficiently, and will inevitably require a mammoth re-structuring for use in R or STATA.

Two datasets:

A one-row-per-patient dataset:

study = StudyDefinition(
  population = patients.all(), 
  sex = patients.sex(),
)
ptid sex
1 F
2 M

And a one-row-per-event dataset:

study_events = StudyDefinitionEvents(
  population = events.from_these_patients(study), 
  sbp = events.with_these_codes(
    sbp_codelist,
    returning = "value",
  ),
  dbp = events.with_these_codes(
    dbp_codelist,
    returning = "value",
  ),
  hosp_admission = events.hosp_admission(
    returning="reason_for_admission",
  )
)
ptid date event value
1 2020-01-01 "sbp" 110
1 2020-01-05 "sbp" 120
1 2020-01-01 "dbp" 70
1 2020-01-05 "dbp" 80
1 2019-06-24 "hosp_admission" "fall"
1 2019-12-25 "hosp_admission" "stroke"
2 2020-01-04 "sbp" 160
2 2020-01-07 "sbp" 170
2 2020-01-15 "sbp" 180
2 2020-01-04 "dbp" 80
2 2020-01-07 "dbp" 80
2 2020-01-15 "dbp" 85
2 2018-10-20 "hosp_admission" "MI"

Here the population = events.from_these_patients(study) says we only want events from patients who are included in the study dataset defined earlier.

The problem with this approach is that you have mixed types in the value column, so it doesn't store data very efficiently, and may be difficult to recover true types when split up for analysis.

Many datasets:

A one-row-per-patient dataset:

study = StudyDefinition(
  population = patients.all(), 
  sex = patients.sex(),
)
ptid sex
1 F
2 M

A dataset for blood pressure:

measurement_events = StudyDefinitionEvents(
  population = events.from_these_patients(study), 
  sbp = events.with_these_codes(
    sbp_codelist,
    returning = "value",
  ),
  dbp = events.with_these_codes(
    dbp_codelist,
    returning = "value",
  ),
)
ptid date sbp dbp
1 2020-01-01 110 70
1 2020-01-05 120 80
2 2020-01-04 160 80
2 2020-01-07 170 80
2 2020-01-15 180 85

which would essentially full_join all SBP measurements and all DBP measurements by ptid and date (and assumes no more than one measurement per day)

And a dataset for hosp admissions:

admission_events = StudyDefinitionEvents(
  population = events.from_these_patients(study), 
  reason = events.hosp_admission(
    returning="reason_for_admission",
  )
)
ptid date admission_reason
1 2019-06-24 "fall"
1 2019-12-25 "stroke"
2 2018-10-20 "MI"

which could also return eg length of stay, discharge location, etc, all linked to the same admission event date. This is my preferred option as it's flexible enough to handle lots of different situations.

chat

These are just some incomplete ideas. Would it be helpful if I collected a few examples of research and/or methods that rely on data structured in this way?

sebbacon commented 4 years ago

These examples are super-helpful. I think your final suggestion makes the most sense. I think you're best placed to know if one or two more real-world examples would add more value. If in doubt, I'd err towards seeking a couple more. In particular it's very helpful from a software design point of view for me to understand exactly how these things map to to the real world (where by "real world" I mean "the world of the analyst / statistician")

wjchulme commented 4 years ago

An additional issue that may need resolving is having more flexibility for "as of" dates. Currently we can ask for example "what's the patient's care home status as of ", where index date is a scalar fixed across all patients. But we can't ask "what's the patient's care home status as of ", i.e., for vectorised dates.

You'd want this for both one-row-per-patient datasets and one-row-per-event datasets. It would probably need a table of "state transition" dates (care home status, dialysis status, employment status, whatever) to implement

DarthCTR commented 4 years ago

@wjchulme you are sent from above. So much clearer than I could have ever described.

Programmatically, the "Many datasets" option makes most sense to me.

How I'd imagine it is being able to create a single "one-row per patient" dataset that contains all fixed variables (ethnicity, gender, index date, age at index date, comorbidities at index date, etc.), essentially what we currently have the ability to do.

But for some studies (honestly, most studies using EHR data), I would have a number of additional tables that I would link to to create longitudinal exposures/covariates. For example, HBA1c (a laboratory measure). Let's say I want to adjust for HBA1c at baseline and through follow-up, allowing for changes over time.

Baseline is easy (if we just wanted most recent measure prior to a fixed index date, say 1 March 2020). Study definitions can currently do this by:

hba1c_pct_baseline=patients.with_these_clinical_events(
        hba1c_old_codes,
        find_last_match_in_period=True,
        on_or_before="2020-02-29",
        returning="numeric_value",
        include_date_of_match=True,
    ),

But for follow-up, which for this example say it's 1 March 2020 to 31 October 2020, I'd want to create a separate "multiple rows per patient" dataset that stores every date and results of HbA1c recorded during the study period. Something like...

hba1c_pct_fup=patients.with_these_clinical_events(
        hba1c_old_codes,
        **find_all_matches_in_period**=True,
        **between=["2020-03-01", "2020-10-31"],**
        returning="numeric_value",
        include_date_of_match=True,
    ),

And this would look like Will's blood pressure dataset above.

The follow-up question would be.. can we use dynamic dates as Will suggests in the message prior. Very important for analyses that say, start at positive test date, medication receipt date, hospitalisation date, etc.

If we get the capability to extract multiple datasets of longitudinal information, a workaround for dynamic dates would be to pull all values in the full study period, say 1 March 2020 to 31 October 2020. And then identify the closest value of something prior to the dynamic index date (e.g., hospitalisation) that would have occurred sometime in that period?

VERY happy to discuss this on a call. So far we have been doing excellent things with static variables. But the full power of EHR data really comes from the longitudinality of the data. :) :)

wjchulme commented 2 years ago

from @sebbacon in this slack thread

what language features might make [event-level data queries via study definitions] easier to implement our policies [that relate to data disclosure / access minimisation]?

Some suggestions: