Initial commit for Mass Conditional RNN

skinniderlab / CLM

MIT License

0 stars 0 forks source link

Open adichaloo opened 1 month ago

vineetbansal commented 2 weeks ago

Just noting down a couple of points we discussed with @skinnider so they're not lost:

The descriptors for sampling can come from the held-out set, for all the samplings done for that training fold. (this will be a tweak in the workflow and will likely not drastically change anything in this PR).
create_training_sets should copy over any non-smile, non-inchikey columns and save them in the augmented dataset (i.e. descriptors don't change for a smile when its augmented).
preprocess should just pass on any non-smile fields to downstream steps (i.e. assume that the raw dataset has smiles and optional descriptors, and we don't try to generate them manually).

vineetbansal commented 1 week ago

Notes on the current implementation:

sample_descriptor_file is the same format as the input to train_models_RNN
sample_descriptors.csv has been introduced with dummy values, but is not being used in any of the tests.

train_models_RNN
- conditional_RNN = False -> no change in behavior
- conditional_RNN = True:
  - csv - descriptors from csv file; if none found, then it comes up with its own (6) hardcoded descriptors.