Open adichaloo opened 1 month ago
Notes on the current implementation:
sample_descriptor_file
is the same format as the input to train_models_RNN
sample_descriptors.csv
has been introduced with dummy values, but is not being used in any of the tests.train_models_RNN
conditional_RNN = False
-> no change in behaviorconditional_RNN = True
:
csv
- descriptors from csv file; if none found, then it comes up with its own (6) hardcoded descriptors.
Just noting down a couple of points we discussed with @skinnider so they're not lost:
The descriptors for sampling can come from the held-out set, for all the samplings done for that training fold. (this will be a tweak in the workflow and will likely not drastically change anything in this PR).
create_training_sets
should copy over any non-smile, non-inchikey columns and save them in the augmented dataset (i.e. descriptors don't change for a smile when its augmented).preprocess
should just pass on any non-smile fields to downstream steps (i.e. assume that the raw dataset has smiles and optional descriptors, and we don't try to generate them manually).