skinniderlab / CLM

MIT License
0 stars 0 forks source link

incorporating train dataset from individual tc_files to overall tc file #209

Closed anushka255 closed 2 weeks ago

anushka255 commented 1 month ago

PR #150 aggregated the results from individual CV folds for train and incorporated it in *_all_{metric}_CV_ranks_structure.csv flies but missed it on *_all_{metric}_CV_tc.csv files. We failed to notice this because we weren't using the tc.csv file until now.

The test files also don't reflect the missing SMILES for source=train because we haven't updated those in a while.

A particular run of the pipeline on della seem to have train SMILES incorporated on *min{min_freq}_all_{metric}_CV_tc.csv.gz.

/scratch/gpfs/vineetb/clm/out/ped_backup_04062024/ to be exact.

This is baffling me because min_freq (PR #172) is something we merged after aggregating CV folds (PR #150). Also checking out to a previous branch that introduce min_freq doesn't include train SMILES on the all_tc_file.

vineetbansal commented 3 weeks ago

Note to self: Introduce/modify test case to exercise code path that utilizes this flag, after bug is fixed.