Open wjchulme opened 4 years ago
Just to make this real for me can you work through an example? I don't quite understand what sort of things obs1date, obs1value1, obs1value2, obs1values3, obs2date, obs2value1, obs2value2, obs2value3,
would be.
We can either structure the data like this:
ptid | date | value |
---|---|---|
1 | 2020-01-01 | 1 |
1 | 2020-01-05 | 2 |
2 | 2020-01-04 | 5 |
2 | 2020-01-07 | 3 |
2 | 2020-01-15 | 4 |
Or like this:
ptid | date1 | value 1 | date2 | value2 | date3 | value3 |
---|---|---|---|---|---|---|
1 | 2020-01-01 | 1 | 2020-01-05 | 2 | ||
2 | 2020-01-04 | 5 | 2020-01-07 | 3 | 2020-01-15 | 4 |
The former is much easier to work with programmatically, and is the format typically expected for stats routines in R, stata, etc
Neither option is possible with study definitions at the mo. At least, not for returning date-value pairs for any given event that can occur more than once (though see Measures for a solution to a related problem).
Sorry, I meant a real-life example of what value
might be. Just so I can cement this in my head!
Could be anything. Common physiological investigations like cholesterol, blood pressure, creatinine, liver function tests, weight. Or reason for hospital admission. Or the result of a screening test for aortic anyurism.
So as a straw man what would it look like in a study definition? Blood pressure on the first of every month between date A and date B? Or the mean blood pressure during each quarter between date A and date B? Both? Other things?
Typically we'd want the actual measurement and the date it was taken, rather average measurements or values at regular time points (which is why in the example above I used irregular event dates, and a different number of events for different patients). So it's actually basically a subset of the event-level data.
I'm not sure how the study definition would look, as I think it would depend on whether we want to separate time-varying data and time-invariant data into two or more separate datasets or not (see OP for details). It also depends if you want a separate dataset for each time-based variable or not, and this really depends on the use-case. Easier to go through some options:
StudyDefinition(
population = patients.all(),
sex = patients.sex(),
sbp = events.with_these_codes(
sbp_codelist,
returning = "value",
date_suffix="_date",
value_suffix = "_value"
),
dbp = events.with_these_codes(
dbp_codelist,
returning = "value",
date_suffix="_date",
value_suffix = "_value"
),
admission = events.hosp_admission(
returning="reason_for_admission",
date_suffix="_date",
value_suffix = "_reason"
),
)
ptid | sex | sbp_date1 | sbp_value1 | sbp_date2 | sbp_value2 | sbp_date3 | sbp_value3 | dbp_date1 | dbp_value1 | dbp_date2 | dbp_value2 | dbp_date3 | dbp_value3 | admission_date1 | admission_reason1 | admission_date2 | admission_reason2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | F | 2020-01-01 | 110 | 2020-01-05 | 120 | 2020-01-01 | 70 | 2020-01-05 | 80 | 2019-06-24 | "fall" | 2019-12-25 | "stroke" | ||||
2 | M | 2020-01-04 | 160 | 2020-01-07 | 170 | 2020-01-15 | 180 | 2020-01-04 | 80 | 2020-01-07 | 80 | 2020-01-15 | 85 | 2018-10-20 | "MI" |
Here the events
library contains a set of extractor functions that return a multiple columns, one new column for each additional event.
The problem with this is that it doesn't store data "tidily" nor efficiently, and will inevitably require a mammoth re-structuring for use in R or STATA.
A one-row-per-patient dataset:
study = StudyDefinition(
population = patients.all(),
sex = patients.sex(),
)
ptid | sex |
---|---|
1 | F |
2 | M |
And a one-row-per-event dataset:
study_events = StudyDefinitionEvents(
population = events.from_these_patients(study),
sbp = events.with_these_codes(
sbp_codelist,
returning = "value",
),
dbp = events.with_these_codes(
dbp_codelist,
returning = "value",
),
hosp_admission = events.hosp_admission(
returning="reason_for_admission",
)
)
ptid | date | event | value |
---|---|---|---|
1 | 2020-01-01 | "sbp" | 110 |
1 | 2020-01-05 | "sbp" | 120 |
1 | 2020-01-01 | "dbp" | 70 |
1 | 2020-01-05 | "dbp" | 80 |
1 | 2019-06-24 | "hosp_admission" | "fall" |
1 | 2019-12-25 | "hosp_admission" | "stroke" |
2 | 2020-01-04 | "sbp" | 160 |
2 | 2020-01-07 | "sbp" | 170 |
2 | 2020-01-15 | "sbp" | 180 |
2 | 2020-01-04 | "dbp" | 80 |
2 | 2020-01-07 | "dbp" | 80 |
2 | 2020-01-15 | "dbp" | 85 |
2 | 2018-10-20 | "hosp_admission" | "MI" |
Here the population = events.from_these_patients(study)
says we only want events from patients who are included in the study
dataset defined earlier.
The problem with this approach is that you have mixed types in the value
column, so it doesn't store data very efficiently, and may be difficult to recover true types when split up for analysis.
A one-row-per-patient dataset:
study = StudyDefinition(
population = patients.all(),
sex = patients.sex(),
)
ptid | sex |
---|---|
1 | F |
2 | M |
A dataset for blood pressure:
measurement_events = StudyDefinitionEvents(
population = events.from_these_patients(study),
sbp = events.with_these_codes(
sbp_codelist,
returning = "value",
),
dbp = events.with_these_codes(
dbp_codelist,
returning = "value",
),
)
ptid | date | sbp | dbp |
---|---|---|---|
1 | 2020-01-01 | 110 | 70 |
1 | 2020-01-05 | 120 | 80 |
2 | 2020-01-04 | 160 | 80 |
2 | 2020-01-07 | 170 | 80 |
2 | 2020-01-15 | 180 | 85 |
which would essentially full_join
all SBP measurements and all DBP measurements by ptid
and date
(and assumes no more than one measurement per day)
And a dataset for hosp admissions:
admission_events = StudyDefinitionEvents(
population = events.from_these_patients(study),
reason = events.hosp_admission(
returning="reason_for_admission",
)
)
ptid | date | admission_reason |
---|---|---|
1 | 2019-06-24 | "fall" |
1 | 2019-12-25 | "stroke" |
2 | 2018-10-20 | "MI" |
which could also return eg length of stay, discharge location, etc, all linked to the same admission event date. This is my preferred option as it's flexible enough to handle lots of different situations.
These are just some incomplete ideas. Would it be helpful if I collected a few examples of research and/or methods that rely on data structured in this way?
These examples are super-helpful. I think your final suggestion makes the most sense. I think you're best placed to know if one or two more real-world examples would add more value. If in doubt, I'd err towards seeking a couple more. In particular it's very helpful from a software design point of view for me to understand exactly how these things map to to the real world (where by "real world" I mean "the world of the analyst / statistician")
An additional issue that may need resolving is having more flexibility for "as of" dates. Currently we can ask for example "what's the patient's care home status as of
You'd want this for both one-row-per-patient datasets and one-row-per-event datasets. It would probably need a table of "state transition" dates (care home status, dialysis status, employment status, whatever) to implement
@wjchulme you are sent from above. So much clearer than I could have ever described.
Programmatically, the "Many datasets" option makes most sense to me.
How I'd imagine it is being able to create a single "one-row per patient" dataset that contains all fixed variables (ethnicity, gender, index date, age at index date, comorbidities at index date, etc.), essentially what we currently have the ability to do.
But for some studies (honestly, most studies using EHR data), I would have a number of additional tables that I would link to to create longitudinal exposures/covariates. For example, HBA1c (a laboratory measure). Let's say I want to adjust for HBA1c at baseline and through follow-up, allowing for changes over time.
Baseline is easy (if we just wanted most recent measure prior to a fixed index date, say 1 March 2020). Study definitions can currently do this by:
hba1c_pct_baseline=patients.with_these_clinical_events(
hba1c_old_codes,
find_last_match_in_period=True,
on_or_before="2020-02-29",
returning="numeric_value",
include_date_of_match=True,
),
But for follow-up, which for this example say it's 1 March 2020 to 31 October 2020, I'd want to create a separate "multiple rows per patient" dataset that stores every date and results of HbA1c recorded during the study period. Something like...
hba1c_pct_fup=patients.with_these_clinical_events(
hba1c_old_codes,
**find_all_matches_in_period**=True,
**between=["2020-03-01", "2020-10-31"],**
returning="numeric_value",
include_date_of_match=True,
),
And this would look like Will's blood pressure dataset above.
The follow-up question would be.. can we use dynamic dates as Will suggests in the message prior. Very important for analyses that say, start at positive test date, medication receipt date, hospitalisation date, etc.
If we get the capability to extract multiple datasets of longitudinal information, a workaround for dynamic dates would be to pull all values in the full study period, say 1 March 2020 to 31 October 2020. And then identify the closest value of something prior to the dynamic index date (e.g., hospitalisation) that would have occurred sometime in that period?
VERY happy to discuss this on a call. So far we have been doing excellent things with static variables. But the full power of EHR data really comes from the longitudinality of the data. :) :)
from @sebbacon in this slack thread
what language features might make [event-level data queries via study definitions] easier to implement our policies [that relate to data disclosure / access minimisation]?
Some suggestions:
Currently,
study_definition.py
only accommodates one-row-per-patient datasets. It would be good to be able to construct one-row-per-patient-per-event study datasets to facilitate the use of epi methods that use repeated observations on the same individuals. This almost certainly means time-indexed repeat observations (eg for time-varying treatments, recurrent events, etc).In theory you could still output all the relevant data in one-row-per-patient form, for example like this:
obs1date, obs1value1, obs1value2, obs1values3, obs2date, obs2value1, obs2value2, obs2value3, obs3date, ...
but this i think would be inefficient since you'd have to do some faffy munging to get the data in that format from the level 2 data, only to then have to undo it all for the analysis (since software for longitudinal methods typically needs a many-rows-per-patient dataset)
I'm assuming this won't "just work" within the current
StudyDefinition
function, because a) you will define a variable that returns multiple values per patient and b) it also needs to return the time index and c) the time index will need to be merged across different timy-varying variables. But if it somehow does work then the rest of this post can be ignored!One solution would be to create two datasets in
study_definition.py
: a one-row-per-patient dataset for all variables that are assumed not to be time-varying (sex, ethnicity, ...) as happens currently withStudyDefinition
; and a many-rows-per-patient dataset that contains all the time-indexed info using a new function, sayStudyDefinitionMulti
(but not that). The datasets can be joined on patient ids afterwards. This would mean there's no need to tinker withStudyDefinition
. It's also useful from a research design perspective as it forces the study author to think about how to structure the data for whatever statistical models are being run. For example, do I needpt_id="1", date_of_death="2020-01-01"
in the one-row-per-patient dataset orpt_id="1", date="2020-01-01", event_type="death"
in the many-rows-per-patient dataset.An alternative would be to allow the time-indexed variables to be returned all within one cell for each patient, so that each cell contains an indexed vector of repeated observations and not just a single value. These can be then unnested for the many-rows-per-patient dataset. But I've no idea how well Python handles vectors/arrays nested within dataframe cells.
Things to consider:
expectations
? Does it need to? Does it complicate dummy data generation?