som-shahlab / ehr_ml

Code for doing machine learning with various EHRs
MIT License
21 stars 3 forks source link

Switch to a continuous time representation #5

Closed EthanSteinberg closed 1 year ago

EthanSteinberg commented 3 years ago

One issue with ehr_ml is that it is fundamentally based on a day granularity timeline. One improvement would be to enable a more continuous setup where both data and predictions can occur in the middle of a day.

Here are the things that would need to get changed in order for that to happen:

  1. The extractor would need to be upgraded to extract intraday information.
  2. The data format (and readers/API) would need to support intraday information.
  3. The label definition would need to be changed to support intra day info.
  4. CLMBR and various other featurizers would need to be upgraded to support partial day featurization.
  5. CLMBR would need to be upgraded to support predicting future events in the current day.
jason-fries commented 3 years ago

Is there any reason we couldn't move to a very generic timeline representation more akin to what's used in BERT? For example, assume a fixed vocabulary V defines our underlying language, plus some number of special placeholder tokens denoting time bucket boundaries -- basically the <SEP> and <CLS> analogues from BERT. For example, a 2 time bucket visit would could look like this: [<CLS>, G43.909, Z79.899, <SEP> T39.1X1, <SEP>]

Here the encoder would make some assumptions on the ordering (or lack there of) within time buckets, but this is analogous to how we treat sequence data in NLP anyway, where we may or may not have some tree structure defined over a sentence. <CLS> would correspond to a representation of the entire timeline (again, like BERT) and time bucket embeddings are just the content of each <SEP> window.

Love to also hear thoughts from @spfohl @scottfleming on representation ideas.

spfohl commented 3 years ago

To respond to the points @jason-fries brings up, I think the BERT-like representation of the sequence could work, but only if coupled with metadata that provides information about the timestamps associated each time interval. Another issue is that we want to be able to support arbitrary discretization and irregular gaps between intervals. One option would be to have something like this available as a result of a processing system that operates on a more generic timeline, but I still think global or local metadata will necessary.