Closed anushka255 closed 1 month ago
These are great improvements! Can you add a test parallel to create_training_sets
that drives this point home by comparing train.inchikeys
and train0.inchikeys
and using an enum_factor > 1 ?
Added a test function test_unique_inchikeys
to verify that both augmented and un-augmented training set have the same unique inchikeys.
Great! I removed the assert outside the fold loop since that's superfluous once the inner asserts pass.
Keeping track of frequencies of each invalid and known SMILES as we iterate instead of using
list.count()
at the end. I ran this step for the first time after incorporatinginvalid
andknown
SMILES, and this step took almost a day to complete. The change in this PR sped up the process by orders of magnitudes.Also, updated input train file for
tabulate_molecules
to be un-augmented train file (train0
). Since we're only comparing inchikeys at this step and augmented as well as un-augmented dataset have same unique inchikeys, using un-augmented train here will be much faster.