som-shahlab / ehr_ml

Code for doing machine learning with various EHRs
MIT License
21 stars 3 forks source link

CLMBR notebook API clarity #12

Closed guolin1 closed 2 years ago

guolin1 commented 3 years ago

Thanks for the notebooks! They're really helpful.

I don't have access to data in the OMOP CDM format yet and so my comments for now mostly relate to API clarity.

Other questions:

EthanSteinberg commented 3 years ago

Wow, that's for all the feedback. I'll respond inline here to your comments and update the documentation accordingly.

clmbr_create_info: The default is 10 patients. Custom seeds are not supported. That's a bug that I need to fix. Codes are pruned by removed all codes that have too few patients with that code. The documentation here could be cleaned up. CLMBR operates on everything except notes

clmbr_train_model:

The size is the embedding size for clmbr. For simplicity, many of the other sizes in the model are set to the same value. Docs here could be cleaned up. no_tied_weights disables an optimization that shares weights across the input and output embeddings. That should be documented better. If use_gru is set to false, a transformer is used instead. Day dropout is supported by the code, but shouldn't be used since it doesn't improve much. Let me strip it from the API.

notebook 3a Time horizon is with respect to the prediction time. The basic idea is how long should you loop ahead for the event whenever you predict. For example, if you are making a prediction for cancer and they get cancer in 20 years that would be considered a negative label with a 1 year time horizon. I think an example might make this more clear in the docs. All featurization is done up until the reference date.

 Yep, if it the patient is not in the banned patient file you will have leakage.

Notebook 3b:

The main problem here is the difference between the raw patient ids and the processed ones. This should be documented more. 

Is there a set of best practice for using clmbr for representation learning? E.g., patients in the evaluation set should be listed in the banned_patient_file. Yeah, this is probably the current biggest issue with our code. We need to improve the documentation here. In general, any patients you evaluate on should be in the banned patient file.

woffett commented 3 years ago

Updated responses with the most recent PR: