develop deep learning-based models for learning structured EHR as sequence
TLSTM first published in Patient Subtyping via Time-Aware LSTM Networks", KDD, 2017
- We re-implement the TLSTM using PyTorch in TLSTM
- the implementation was validated using the sync data released in the original TLSTM repo.
- The results we got on train is AUC of 0.958 and test is AUC of 0.916
- input data format
# We use a slightly different input format compared to the original implementation. # input - three conpoenents: 1. features (OHE or embeddings); 2. time intervals; 3. labels # feature - 3D Tensor - [Batch, Sequence, Data] # Time - 3D Tensor - [Batch, Sequence, time_diff] # label - 2D numpy array - [Batch, label]
### LSTM
- This is baseline for comparison purpose
- input data shape
### Embedding
- we only support pre-trained embeddings
- we do not support in-situ embedding random initialization
- if you want to random initialize embeddings, you have to create a random initialized embeddings yourself
- embedding input shape: ```[Batch, Squence, Features, Embeddings]```. Note that we have 4 dimensions instead of 3 because the each features has been converted to a embedding which is an dense vector whereas the OHE directy using 1-0 to represent features.
### self-attention (TODO)
- we implement a self-attention architecture to replace LSTM and TLSTM
- self-attention proved to be perform better in many NLP tasks over LSTM (seq2seq translation)
- position enmbedding can be used for encode time variance which is suitable for replace TLSTM
## MixStaticSeq model
- we develope a mix model which can handle both static features (e.g., demographics) and time-series features (e.g., diagnoses, medication, procedure, labs)
- we use MLP + TLSTM (or LSTM) as model architecture
- we have a specific data input format doc at:
- The current evaluation will be measured as ROC-AUC, sensitivity, specificity, and accuracy
- input data format