Closed guolin1 closed 2 years ago
Wow, that's for all the feedback. I'll respond inline here to your comments and update the documentation accordingly.
clmbr_create_info: The default is 10 patients. Custom seeds are not supported. That's a bug that I need to fix. Codes are pruned by removed all codes that have too few patients with that code. The documentation here could be cleaned up. CLMBR operates on everything except notes
clmbr_train_model:
The size is the embedding size for clmbr. For simplicity, many of the other sizes in the model are set to the same value. Docs here could be cleaned up. no_tied_weights disables an optimization that shares weights across the input and output embeddings. That should be documented better. If use_gru is set to false, a transformer is used instead. Day dropout is supported by the code, but shouldn't be used since it doesn't improve much. Let me strip it from the API.
notebook 3a Time horizon is with respect to the prediction time. The basic idea is how long should you loop ahead for the event whenever you predict. For example, if you are making a prediction for cancer and they get cancer in 20 years that would be considered a negative label with a 1 year time horizon. I think an example might make this more clear in the docs. All featurization is done up until the reference date.
Yep, if it the patient is not in the banned patient file you will have leakage.
Notebook 3b:
The main problem here is the difference between the raw patient ids and the processed ones. This should be documented more.
Is there a set of best practice for using clmbr for representation learning? E.g., patients in the evaluation set should be listed in the banned_patient_file. Yeah, this is probably the current biggest issue with our code. We need to improve the documentation here. In general, any patients you evaluate on should be in the banned patient file.
Updated responses with the most recent PR:
clmbr_create_info
clmbr_train_model
(notebook 2)
--size
--use_gru
is not specifiedOther notebooks have not been updated as much and documentation. We're planning to update the notebooks to all use the same cohort such that we make proper use of the banned_patient_file
.
Thanks for the notebooks! They're really helpful.
I don't have access to data in the OMOP CDM format yet and so my comments for now mostly relate to API clarity.
clmbr_create_info
(notebook 2)clmbr_train_model
(notebook 2)notebook 3a
Notebook 3b
Other questions: