Support for multiple rows per patient

wjchulme commented 4 years ago

Currently, study_definition.py only accommodates one-row-per-patient datasets. It would be good to be able to construct one-row-per-patient-per-event study datasets to facilitate the use of epi methods that use repeated observations on the same individuals. This almost certainly means time-indexed repeat observations (eg for time-varying treatments, recurrent events, etc).

In theory you could still output all the relevant data in one-row-per-patient form, for example like this:

obs1date, obs1value1, obs1value2, obs1values3, obs2date, obs2value1, obs2value2, obs2value3, obs3date, ...

but this i think would be inefficient since you'd have to do some faffy munging to get the data in that format from the level 2 data, only to then have to undo it all for the analysis (since software for longitudinal methods typically needs a many-rows-per-patient dataset)

I'm assuming this won't "just work" within the current StudyDefinition function, because a) you will define a variable that returns multiple values per patient and b) it also needs to return the time index and c) the time index will need to be merged across different timy-varying variables. But if it somehow does work then the rest of this post can be ignored!

One solution would be to create two datasets in study_definition.py: a one-row-per-patient dataset for all variables that are assumed not to be time-varying (sex, ethnicity, ...) as happens currently with StudyDefinition; and a many-rows-per-patient dataset that contains all the time-indexed info using a new function, say StudyDefinitionMulti (but not that). The datasets can be joined on patient ids afterwards. This would mean there's no need to tinker with StudyDefinition. It's also useful from a research design perspective as it forces the study author to think about how to structure the data for whatever statistical models are being run. For example, do I need pt_id="1", date_of_death="2020-01-01" in the one-row-per-patient dataset or pt_id="1", date="2020-01-01", event_type="death" in the many-rows-per-patient dataset.

An alternative would be to allow the time-indexed variables to be returned all within one cell for each patient, so that each cell contains an indexed vector of repeated observations and not just a single value. These can be then unnested for the many-rows-per-patient dataset. But I've no idea how well Python handles vectors/arrays nested within dataframe cells.

Things to consider:

IG/privacy. My understanding is that it's fine and permissions already exist to output/access this type of data to level 3, but we just have to be careful what the knock-on effects are for any research outputs, especially with event dates.
it obviously must accomodate irregular observations, where there are different obs times for each patient.
will this play nicely with expectations? Does it need to? Does it complicate dummy data generation?
documentation / explainers specifically mention output to level 3 is one-row-per-patient. This will need to change.
there are potential challenges with circular dependencies / duplication of code when defining the cohort selection criteria if using the two datasets approach.

sebbacon commented 4 years ago

Just to make this real for me can you work through an example? I don't quite understand what sort of things obs1date, obs1value1, obs1value2, obs1values3, obs2date, obs2value1, obs2value2, obs2value3, would be.

wjchulme commented 4 years ago

We can either structure the data like this:

ptid	date	value
1	2020-01-01	1
1	2020-01-05	2
2	2020-01-04	5
2	2020-01-07	3
2	2020-01-15	4

Or like this:

ptid	date1	value 1	date2	value2	date3	value3
1	2020-01-01	1	2020-01-05	2
2	2020-01-04	5	2020-01-07	3	2020-01-15	4

The former is much easier to work with programmatically, and is the format typically expected for stats routines in R, stata, etc

Neither option is possible with study definitions at the mo. At least, not for returning date-value pairs for any given event that can occur more than once (though see Measures for a solution to a related problem).

sebbacon commented 4 years ago

Sorry, I meant a real-life example of what value might be. Just so I can cement this in my head!

wjchulme commented 4 years ago

Could be anything. Common physiological investigations like cholesterol, blood pressure, creatinine, liver function tests, weight. Or reason for hospital admission. Or the result of a screening test for aortic anyurism.

sebbacon commented 4 years ago

So as a straw man what would it look like in a study definition? Blood pressure on the first of every month between date A and date B? Or the mean blood pressure during each quarter between date A and date B? Both? Other things?

wjchulme commented 4 years ago

Typically we'd want the actual measurement and the date it was taken, rather average measurements or values at regular time points (which is why in the example above I used irregular event dates, and a different number of events for different patients). So it's actually basically a subset of the event-level data.

I'm not sure how the study definition would look, as I think it would depend on whether we want to separate time-varying data and time-invariant data into two or more separate datasets or not (see OP for details). It also depends if you want a separate dataset for each time-based variable or not, and this really depends on the use-case. Easier to go through some options:

One massive one-row-per-patient dataset:

StudyDefinition(
  population = patients.all(), 
  sex = patients.sex(),
  sbp = events.with_these_codes(
    sbp_codelist,
    returning = "value",
    date_suffix="_date",
    value_suffix = "_value"
  ),
  dbp = events.with_these_codes(
    dbp_codelist,
    returning = "value",
    date_suffix="_date",
    value_suffix = "_value"
  ),
  admission = events.hosp_admission(
    returning="reason_for_admission",
    date_suffix="_date",
    value_suffix = "_reason"
  ),
)

ptid	sex	sbp_date1	sbp_value1	sbp_date2	sbp_value2	sbp_date3	sbp_value3	dbp_date1	dbp_value1	dbp_date2	dbp_value2	dbp_date3	dbp_value3	admission_date1	admission_reason1	admission_date2	admission_reason2
1	F	2020-01-01	110	2020-01-05	120			2020-01-01	70	2020-01-05	80			2019-06-24	"fall"	2019-12-25	"stroke"
2	M	2020-01-04	160	2020-01-07	170	2020-01-15	180	2020-01-04	80	2020-01-07	80	2020-01-15	85	2018-10-20	"MI"

Here the events library contains a set of extractor functions that return a multiple columns, one new column for each additional event.

The problem with this is that it doesn't store data "tidily" nor efficiently, and will inevitably require a mammoth re-structuring for use in R or STATA.

Two datasets:

A one-row-per-patient dataset:

study = StudyDefinition(
  population = patients.all(), 
  sex = patients.sex(),
)

ptid	sex
1	F
2	M

And a one-row-per-event dataset:

study_events = StudyDefinitionEvents(
  population = events.from_these_patients(study), 
  sbp = events.with_these_codes(
    sbp_codelist,
    returning = "value",
  ),
  dbp = events.with_these_codes(
    dbp_codelist,
    returning = "value",
  ),
  hosp_admission = events.hosp_admission(
    returning="reason_for_admission",
  )
)

ptid	date	event	value
1	2020-01-01	"sbp"	110
1	2020-01-05	"sbp"	120
1	2020-01-01	"dbp"	70
1	2020-01-05	"dbp"	80
1	2019-06-24	"hosp_admission"	"fall"
1	2019-12-25	"hosp_admission"	"stroke"
2	2020-01-04	"sbp"	160
2	2020-01-07	"sbp"	170
2	2020-01-15	"sbp"	180
2	2020-01-04	"dbp"	80
2	2020-01-07	"dbp"	80
2	2020-01-15	"dbp"	85
2	2018-10-20	"hosp_admission"	"MI"

Here the population = events.from_these_patients(study) says we only want events from patients who are included in the study dataset defined earlier.

The problem with this approach is that you have mixed types in the value column, so it doesn't store data very efficiently, and may be difficult to recover true types when split up for analysis.

Many datasets:

A one-row-per-patient dataset:

study = StudyDefinition(
  population = patients.all(), 
  sex = patients.sex(),
)

ptid	sex
1	F
2	M

A dataset for blood pressure:

measurement_events = StudyDefinitionEvents(
  population = events.from_these_patients(study), 
  sbp = events.with_these_codes(
    sbp_codelist,
    returning = "value",
  ),
  dbp = events.with_these_codes(
    dbp_codelist,
    returning = "value",
  ),
)

ptid	date	sbp	dbp
1	2020-01-01	110	70
1	2020-01-05	120	80
2	2020-01-04	160	80
2	2020-01-07	170	80
2	2020-01-15	180	85

which would essentially full_join all SBP measurements and all DBP measurements by ptid and date (and assumes no more than one measurement per day)

And a dataset for hosp admissions:

admission_events = StudyDefinitionEvents(
  population = events.from_these_patients(study), 
  reason = events.hosp_admission(
    returning="reason_for_admission",
  )
)

ptid	date	admission_reason
1	2019-06-24	"fall"
1	2019-12-25	"stroke"
2	2018-10-20	"MI"

which could also return eg length of stay, discharge location, etc, all linked to the same admission event date. This is my preferred option as it's flexible enough to handle lots of different situations.

chat

These are just some incomplete ideas. Would it be helpful if I collected a few examples of research and/or methods that rely on data structured in this way?

sebbacon commented 4 years ago

These examples are super-helpful. I think your final suggestion makes the most sense. I think you're best placed to know if one or two more real-world examples would add more value. If in doubt, I'd err towards seeking a couple more. In particular it's very helpful from a software design point of view for me to understand exactly how these things map to to the real world (where by "real world" I mean "the world of the analyst / statistician")

wjchulme commented 4 years ago

An additional issue that may need resolving is having more flexibility for "as of" dates. Currently we can ask for example "what's the patient's care home status as of ", where index date is a scalar fixed across all patients. But we can't ask "what's the patient's care home status as of ", i.e., for vectorised dates.

You'd want this for both one-row-per-patient datasets and one-row-per-event datasets. It would probably need a table of "state transition" dates (care home status, dialysis status, employment status, whatever) to implement

DarthCTR commented 4 years ago

@wjchulme you are sent from above. So much clearer than I could have ever described.

Programmatically, the "Many datasets" option makes most sense to me.

How I'd imagine it is being able to create a single "one-row per patient" dataset that contains all fixed variables (ethnicity, gender, index date, age at index date, comorbidities at index date, etc.), essentially what we currently have the ability to do.

But for some studies (honestly, most studies using EHR data), I would have a number of additional tables that I would link to to create longitudinal exposures/covariates. For example, HBA1c (a laboratory measure). Let's say I want to adjust for HBA1c at baseline and through follow-up, allowing for changes over time.

Baseline is easy (if we just wanted most recent measure prior to a fixed index date, say 1 March 2020). Study definitions can currently do this by:

hba1c_pct_baseline=patients.with_these_clinical_events(
        hba1c_old_codes,
        find_last_match_in_period=True,
        on_or_before="2020-02-29",
        returning="numeric_value",
        include_date_of_match=True,
    ),

But for follow-up, which for this example say it's 1 March 2020 to 31 October 2020, I'd want to create a separate "multiple rows per patient" dataset that stores every date and results of HbA1c recorded during the study period. Something like...

hba1c_pct_fup=patients.with_these_clinical_events(
        hba1c_old_codes,
        **find_all_matches_in_period**=True,
        **between=["2020-03-01", "2020-10-31"],**
        returning="numeric_value",
        include_date_of_match=True,
    ),

And this would look like Will's blood pressure dataset above.

The follow-up question would be.. can we use dynamic dates as Will suggests in the message prior. Very important for analyses that say, start at positive test date, medication receipt date, hospitalisation date, etc.

If we get the capability to extract multiple datasets of longitudinal information, a workaround for dynamic dates would be to pull all values in the full study period, say 1 March 2020 to 31 October 2020. And then identify the closest value of something prior to the dynamic index date (e.g., hospitalisation) that would have occurred sometime in that period?

VERY happy to discuss this on a call. So far we have been doing excellent things with static variables. But the full power of EHR data really comes from the longitudinality of the data. :) :)

wjchulme commented 2 years ago

from @sebbacon in this slack thread

what language features might make [event-level data queries via study definitions] easier to implement our policies [that relate to data disclosure / access minimisation]?

Some suggestions:

No defaults on the filters -- you must choose an appropriate subset of events for certain event-level queries. Eg for hospital admissions you have to explicit select the admission type(s). This makes less sense for other events like covid testing for example, where if you need to return multiple tests in a period, you probably just want all of them.
Warnings on event-level data queries if for at least one patient there are more than X events returned. And if that number is hit then additional review / authorisation is required. I'd be very surprised if there were more than 5 admissions or more than 5 A&E attendances within the 4 month period for the booster study, for example. But other studies may need more events. So the optimal X might need some trial and error. This would ensure that the new event-level functionality doesn't greatly exceed the information content of the per-column hacks (and so reidentification risk is similar), it just makes the information more efficiently queried and stored.
Limits on the size of outputs to L4

opensafely-core / cohort-extractor