Closed EthanSteinberg closed 1 year ago
Is there any reason we couldn't move to a very generic timeline representation more akin to what's used in BERT?
For example, assume a fixed vocabulary V defines our underlying language, plus some number of special placeholder tokens denoting time bucket boundaries -- basically the <SEP>
and <CLS>
analogues from BERT. For example, a 2 time bucket visit would could look like this: [<CLS>, G43.909, Z79.899, <SEP> T39.1X1, <SEP>]
Here the encoder would make some assumptions on the ordering (or lack there of) within time buckets, but this is analogous to how we treat sequence data in NLP anyway, where we may or may not have some tree structure defined over a sentence. <CLS>
would correspond to a representation of the entire timeline (again, like BERT) and time bucket embeddings are just the content of each <SEP>
window.
Love to also hear thoughts from @spfohl @scottfleming on representation ideas.
To respond to the points @jason-fries brings up, I think the BERT-like representation of the sequence could work, but only if coupled with metadata that provides information about the timestamps associated each time interval. Another issue is that we want to be able to support arbitrary discretization and irregular gaps between intervals. One option would be to have something like this available as a result of a processing system that operates on a more generic timeline, but I still think global or local metadata will necessary.
One issue with ehr_ml is that it is fundamentally based on a day granularity timeline. One improvement would be to enable a more continuous setup where both data and predictions can occur in the middle of a day.
Here are the things that would need to get changed in order for that to happen: