mmcdermott / MEDS_transforms

A simple set of MEDS polars-based ETL and transformation functions
MIT License
15 stars 3 forks source link

Add transformation for injecting time-interval codes based on config specification #120

Open prenc opened 1 month ago

prenc commented 1 month ago

This is how I do it in ETHOS:

Specification:

time_intervals_spec:
  5m-15m:
    minutes: 5
  15m-1h:
    minutes: 15
  1h-2h:
    hours: 1
  2h-6h:
    hours: 2
#...
  1mt-3mt:
    weeks: 4
  3mt-6mt:
    weeks: 12
  =6mt:
    weeks: 24

The above is a list of breaks that specifies how to inject the time interval tokens. Everything below the first break (5 minutes) is omitted, i.e., no time interval will be added. The 5m-15m interval goes into all gaps between events that are in the [5min, 15min) range, and so on, up to the last one =6mt, which is added as many times as it fits in the time gap (time_gap // 24weeks).

Rough Example: image

mmcdermott commented 1 month ago

@prenc , I bet we can integrate this into the https://github.com/mmcdermott/MEDS_transforms/blob/main/src/MEDS_transforms/transforms/add_time_derived_measurements.py setup

mmcdermott commented 1 month ago

E.g., if you add a functor like https://github.com/mmcdermott/MEDS_transforms/blob/main/src/MEDS_transforms/transforms/add_time_derived_measurements.py#L270 to take a config and return a function that spits out only the time interval measurements, the rest of the setup should handle merging it back in and folding it into the overall transformation for us.

mmcdermott commented 3 weeks ago

@prenc so am I correct in thinking that if there are 100 events in a row all 30 seconds apart, then there would be no time interval tokens added to those events and the model wouldn't have any way to tell how much time had passed between the first event and the last, only that it was somewhere between 0 seconds and 500 minutes? I'm not suggesting that is a problem, just making sure I understand.

prenc commented 3 weeks ago

@prenc so am I correct in thinking that if there are 100 events in a row all 30 seconds apart, then there would be no time interval tokens added to those events and the model wouldn't have any way to tell how much time had passed between the first event and the last, only that it was somewhere between 0 seconds and 500 minutes? I'm not suggesting that is a problem, just making sure I understand.

@mmcdermott Yes, that's correct. But why 500 minutes? Generally, it depends on the granularity of the events in the dataset, which is why we don't use ICU readings in the current implementation. At some point, I would like to automate this by determining the best time-interval spec based on a histogram of time gaps, but it requires more research.

mmcdermott commented 3 weeks ago

@mmcdermott Yes, that's correct. But why 500 minutes? Generally, it depends on the granularity of the events in the dataset, which is why we don't use ICU readings in the current implementation. At some point, I would like to automate this by determining the best time-interval spec based on a histogram of time gaps, but it requires more research.

5min * 100 events in a row, where 5minutes was the minimum interval from your example config is where I got 500 minutes. I'm not suggesting that that event configuration is remotely reasonable or that your sample config would be used for a dataset where it was, I was just constructing a setting to make sure I understood how the tokens would be injecting.

mmcdermott commented 3 weeks ago

Also, regarding the configuration language, @justin13601 was there a library that you used for parsing time intervals in ACES, or did you write something by hand for that?

ChaoPang commented 3 weeks ago

@mmcdermott CEHR-BERT uses a set of coarse time tokes as we didn't focus so much on the ICU visits. This is an overall idea of the CEHR-BERT patient representation, where we insert inter/intra-visit time tokens. This is the figure I took from the CEHR-GPT paper, and please ignore the demographics prompt cos we did not include it in CEHR-BERT.

patient_representation

For inter-visit time tokens, we use the following logic to insert time tokens between visits where,

For intra-visit time tokens, we insert the time tokens between groups of events that occur on the same day, the time token construction logic is the same as the inter-visit time tokens.

For CEHR-GPT, we switched everything to day tokens for inter/intra-visit time intervals of less than 1095 days and used a [LT] token (long term) for anything longer than 1095 days. The reason we did it is that we focused on synthetic data generation in the CEHR-GPT paper, we wanted to preserve the patient timeline at the day level. However, we are not handling the granular time intervals (e.g. 1 hour) between events within the ICU visit, which is something we would like to incorporate in the future.

I agree with @prenc, it would be better to figure out the most optimal set of time tokens based on the data.

mmcdermott commented 3 weeks ago

For inter-visit time tokens, we use the following logic to insert time tokens between visits where,

  • For anything less than a month (4 weeks), we map it to a week token by taking the floor e.g. 1 day -> [W0], 10 days -> [W1]
  • For anything between a month and a year, we map it to a month token by taking the floor e.g. 32 days -> [M1]
  • Anything that is beyond 1 year time, we map everything to a special [LT] (long-term) token

For intra-visit time tokens, we insert the time tokens between groups of events that occur on the same day, the time token construction logic is the same as the inter-visit time tokens.

This is super helpful, Thanks @ChaoPang -- a few questions:

  1. When you say "we insert the time tokens between groups of events that occur on the same day" -- is this calendar day? So, if I have, all within a visit, a sequence [LAB1 @ 1/1 10am] [LAB2 @ 1/1 4pm] [DX1 @ 1/2 9am] [LAB4 @ 1/2 5pm] [DX5 @ 1/3 6pm] would that become [LAB1 @ 1/1 10am] [LAB2 @ 1/1 4pm] [W0] [DX1 @ 1/2 9am] [LAB4 @ 1/2 5pm] [W0] [DX5 @ 1/3 6pm] or something else?
  2. The critical difference between the two styles to me here seems like it is that you have different window sizes for "inter-visit tokens" vs. for "intra-visit tokens"; is that right @ChaoPang?
  3. Are the tokens themselves different for inter-visit vs. intra-visit (presuming it is the same size time gap), or would they map to different vocabulary elements? E.g., if I had a 3-day time gap within a visit and a 3-day time gap between visits, would those both be the same time interval token, or would one be [IN_VISIT 3D] and the other be [BETWEEN_VISITS 3D]?
  4. Am I correct in thinking that, like in ETHOS, you add the appropriate token between any two measurements with distinct times? But, in your case, you have a time token explicitly that maps to "these measurements aren't at the same time, but it is smaller than any other time token" (your [W0] token in the example above) whereas in ETHOS per my conversation with @prenc above there is no token for that interval and instead they would only include things from [W1] beyond?
ChaoPang commented 3 weeks ago

@mmcdermott Thanks for posting these very good questions

  1. When you say "we insert the time tokens between groups of events that occur on the same day" -- is this calendar day? So, if I have, all within a visit, a sequence [LAB1 @ 1/1 10am] [LAB2 @ 1/1 4pm] [DX1 @ 1/2 9am] [LAB4 @ 1/2 5pm] [DX5 @ 1/3 6pm] would that become [LAB1 @ 1/1 10am] [LAB2 @ 1/1 4pm] [W0] [DX1 @ 1/2 9am] [LAB4 @ 1/2 5pm] [W0] [DX5 @ 1/3 6pm] or something else?

This is exactly right, the time token construction is based on the date of the event.

  1. The critical difference between the two styles to me here seems like it is that you have different window sizes for "inter-visit tokens" vs. for "intra-visit tokens"; is that right @ChaoPang?

Did you mean the time interval distributions would be different between inter and intra-visit intervals? If so, you are absolutely right there. Ideally, we should use a different set of time tokens within inpatient visits, but for simplicity, I just used the same granularity for now.

  1. Are the tokens themselves different for inter-visit vs. intra-visit (presuming it is the same size time gap), or would they map to different vocabulary elements? E.g., if I had a 3-day time gap within a visit and a 3-day time gap between visits, would those both be the same time interval token, or would one be [IN_VISIT 3D] and the other be [BETWEEN_VISITS 3D]?

Yes, I used different sets of tokens because their semantics are different. E.g. W1 v.s. i-W1, where "i" denotes inpatient.

  1. Am I correct in thinking that, like in ETHOS, you add the appropriate token between any two measurements with distinct times? But, in your case, you have a time token explicitly that maps to "these measurements aren't at the same time, but it is smaller than any other time token" (your [W0] token in the example above) whereas in ETHOS per my conversation with @prenc above there is no token for that interval and instead they would only include things from [W1] beyond?

That's almost correct. For inpatient visits, I would only create a time token between two measurements if they occur on two different dates. If the events occur on the same day, there would be no time tokens created, just like the toy example you created above. Did I misunderstand your question?

justin13601 commented 3 weeks ago

Also, regarding the configuration language, @justin13601 was there a library that you used for parsing time intervals in ACES, or did you write something by hand for that?

There was a library - pytimeparse - very easy to use and accepts a pretty flexible range of strings

mmcdermott commented 3 weeks ago

@ChaoPang -- thank you, you answered all my questions perfectly. Three last questions (ok probably not last but three current new questions):

  1. What do you do with measurements that occur outside the bounds of a visit? E.g., in MIMIC, we will have hospitalizations, defined by hadm_id, but outside of those there are ED events, death events, the date of birth event, and maybe others. More generally, if you have a measurement you can't localize to any visit, how do you deal with that, both in terms of the measurement itself and any time interval tokens associated with it and the surrounding other measurements (in or not in visits)?
  2. Do you ever have cause to have multiple levels of "visit", or multiple kinds of "visits" that have different time interval token buckets? E.g., within a hospital admission, an ICU admission may have very different time intervals in general.
  3. If we just force the user to tell you what a proxy for visit_id that must be defined on their visit for your models to work, what properties must that visit_id hold? E.g., must it be the case that every measurement within that visit's start and end events have that visit_id recorded? I'm wondering about the best way to support your use case as dealing with hierarchical visits may be challenging to do correctly in a simple format here.
ChaoPang commented 3 weeks ago

@mmcdermott Again, all good questions, you identified the limitations of CEHR-BERT and CEHR-GPT.

  1. What do you do with measurements that occur outside the bounds of a visit? E.g., in MIMIC, we will have hospitalizations, defined by hadm_id, but outside of those there are ED events, death events, the date of birth event, and maybe others. More generally, if you have a measurement you can't localize to any visit, how do you deal with that, both in terms of the measurement itself and any time interval tokens associated with it and the surrounding other measurements (in or not in visits)?

As I mentioned earlier, CEHR-BERT/CEHR-GPT was developed primarily for EHRs that have relatively complete patient histories. If the measurements occur outside the bounds of a visit, we currently do not take them into account. What we did with Columbia data is that we tried to connect the "orphan" records to the visits, this is a unique property of the Columbia data as we had separate outpatient and inpatient EHR systems before we switched over to EPIC in 2020. For CEHR-GPT, we included special tokens that do not need to be part of the visit, e.g. these tokens [year_of_first_visit], [age_of_first_visit], [gender], [race] are placed at the beginning of the patient sequence, the [death] token is placed at the end of the sequence after the last visit. For CEHR-BERT, we don't include those special tokens, CEHR-BERT is only used to generate the patient representation for the downstream prediction tasks. I have been thinking about this problem as well, I know this could be a bigger issue with data like MIMIC, a temporary workaround would be to create artificial visits that group "orphan" records. The reason we stick to this predefined template [VS]...[VE] [time_token][VS]...[VE] is that it's easier for us to convert this sequence back to OMOP data for synthetic data generation. In addition, the presence of [VS]/[VE] seems to help with the downstream predictions in CEHR-BERT, and I hypothesize that the model can probably infer the number of visits in a patient history based on these pairs. However, I should probably expand this pattern so it allows the "orphan" records in the future.

  1. Do you ever have cause to have multiple levels of "visit", or multiple kinds of "visits" that have different time interval token buckets? E.g., within a hospital admission, an ICU admission may have very different time intervals in general.

Not for now, but I think it's important to use time tokens at different granularities, I am currently experimenting with hour tokens for CEHR-GPT (inspired by Ethos of course)

  1. If we just force the user to tell you what a proxy for visit_id that must be defined on their visit for your models to work, what properties must that visit_id hold? E.g., must it be the case that every measurement within that visit's start and end events have that visit_id recorded? I'm wondering about the best way to support your use case as dealing with hierarchical visits may be challenging to do correctly in a simple format here.

I am not sure if I want to force the users to define what a visit is, I was hoping this could be done post-MEDS ETL. Now I am typing this, I think I know what you mean by proxy, as long as we can identify the start and the end of a visit, we should be able to construct the visit. For instance, we can assume if the records occur on the same day as some arbitrary visit, we could potentially connect these records to that visit, in case of multiple potential visits, we could connect to the one that is the "closest" to the record.

mmcdermott commented 3 weeks ago

@ChaoPang thanks -- very helpful. One clarifying question -- when you say you don't take orphaned measurements into account, do you mean you drop them entirely from your model? Or that you just don't generate artificial time tokens for them?

ChaoPang commented 3 weeks ago

@mmcdermott these orphan records that can't be linked to a visit get dropped entirely for now.

mmcdermott commented 2 weeks ago

This is some starter code that @Oufattole put together for his use, fyi: https://gist.github.com/Oufattole/ab73852d1719e1db13280c9191da1518