CLMBR notebook API clarity

Thanks for the notebooks! They're really helpful.

I don't have access to data in the OMOP CDM format yet and so my comments for now mostly relate to API clarity.

clmbr_create_info (notebook 2)
- What is the default min_patient_count.
- Can seed value be specified as part of the input arg to clmbr_create_info?
- Codes were pruned here, is there documentation for this process?
- Does clmbr operate on all available features in a starr omop dataset or a specific set of feature categories? A documentation link would be great.
clmbr_train_model (notebook 2)
- What are size (appears to be the size of the output embeddings but could be clearer) and no_tied_weights?
- If use_gru is not specified, what's the alternative? How to configure clmbr to use transformers?
- It seems like Day dropout is possible based on the training log but is not available as an input arg to clmbr_train_model.
notebook 3a
- Is the time horizon defined with respect to each available day in the patient's timeline?
- My understanding is that the labeler randomly selects a label for each patient. If so, what is the timeline over which clmbr featurize the patient? Is it up to the reference date?
  - This appears to be the case based on notebook 3b's markdown above cell 10.
- Related to the above, for patients in the test set (cell 7), is it correct to assume that during pre-training clmbr might have accessed these patients' timeline beyond the reference date if the patients were not listed in the banned_patient_file? If so, it might be helpful to make users aware of this.
Notebook 3b
- Cell 10 - how are the patient ids different? Would convert_patient_data always work as long as the patient_ids are directly obtained from STARR OMOP?

Other questions:

Is there a set of best practice for using clmbr for representation learning? E.g., patients in the evaluation set should be listed in the banned_patient_file.

Wow, that's for all the feedback. I'll respond inline here to your comments and update the documentation accordingly.

clmbr_create_info: The default is 10 patients. Custom seeds are not supported. That's a bug that I need to fix. Codes are pruned by removed all codes that have too few patients with that code. The documentation here could be cleaned up. CLMBR operates on everything except notes

clmbr_train_model:

The size is the embedding size for clmbr. For simplicity, many of the other sizes in the model are set to the same value. Docs here could be cleaned up. no_tied_weights disables an optimization that shares weights across the input and output embeddings. That should be documented better. If use_gru is set to false, a transformer is used instead. Day dropout is supported by the code, but shouldn't be used since it doesn't improve much. Let me strip it from the API.

notebook 3a Time horizon is with respect to the prediction time. The basic idea is how long should you loop ahead for the event whenever you predict. For example, if you are making a prediction for cancer and they get cancer in 20 years that would be considered a negative label with a 1 year time horizon. I think an example might make this more clear in the docs. All featurization is done up until the reference date.

 Yep, if it the patient is not in the banned patient file you will have leakage.

Notebook 3b:

The main problem here is the difference between the raw patient ids and the processed ones. This should be documented more.

Is there a set of best practice for using clmbr for representation learning? E.g., patients in the evaluation set should be listed in the banned_patient_file. Yeah, this is probably the current biggest issue with our code. We need to improve the documentation here. In general, any patients you evaluate on should be in the banned patient file.

Updated responses with the most recent PR:

clmbr_create_info
- "What is the default min_patient_count."
  - The code was using 100 as the default, so I updated the help message to say so, @Lalaland not sure if that's your intention though.
- "Can seed value be specified as part of the input arg to clmbr_create_info?"
  - Seed value can now be specified
- "Codes were pruned here, is there documentation for this process?"
  - Documentation has yet to be added, this is still an open issue
- "Does clmbr operate on all available features in a starr omop dataset or a specific set of feature categories? A documentation link would be great."
  - Documentation has yet to be added, this is still an open issue
clmbr_train_model (notebook 2)
- "What are size (appears to be the size of the output embeddings but could be clearer) and no_tied_weights?"
  - More descriptive help text has been added to for --size
- "If use_gru is not specified, what's the alternative? How to configure clmbr to use transformers?"
  - More descriptive help text has been added to specify that a transformer is used if --use_gru is not specified
- "It seems like Day dropout is possible based on the training log but is not available as an input arg to clmbr_train_model."
  - I've added day_dropout as an option, but I'm not sure if it should be there. If left unspecified the default behavior should be the same as before.
Other notebooks have not been updated as much and documentation. We're planning to update the notebooks to all use the same cohort such that we make proper use of the banned_patient_file.

som-shahlab / ehr_ml

CLMBR notebook API clarity #12