minor change to optimize tabulate molecules

anushka255 commented 1 month ago

Keeping track of frequencies of each invalid and known SMILES as we iterate instead of using list.count() at the end. I ran this step for the first time after incorporating invalid and known SMILES, and this step took almost a day to complete. The change in this PR sped up the process by orders of magnitudes.

Also, updated input train file for tabulate_molecules to be un-augmented train file (train0). Since we're only comparing inchikeys at this step and augmented as well as un-augmented dataset have same unique inchikeys, using un-augmented train here will be much faster.

vineetbansal commented 1 month ago

These are great improvements! Can you add a test parallel to create_training_sets that drives this point home by comparing train.inchikeys and train0.inchikeys and using an enum_factor > 1 ?

anushka255 commented 1 month ago

Added a test function test_unique_inchikeys to verify that both augmented and un-augmented training set have the same unique inchikeys.

vineetbansal commented 1 month ago

Great! I removed the assert outside the fold loop since that's superfluous once the inner asserts pass.

skinniderlab / CLM

minor change to optimize tabulate molecules #207