PR #150 aggregated the results from individual CV folds for train and incorporated it in *_all_{metric}_CV_ranks_structure.csv flies but missed it on *_all_{metric}_CV_tc.csv files. We failed to notice this because we weren't using the tc.csv file until now.
The test files also don't reflect the missing SMILES for source=train because we haven't updated those in a while.
A particular run of the pipeline on della seem to have train SMILES incorporated on *min{min_freq}_all_{metric}_CV_tc.csv.gz.
/scratch/gpfs/vineetb/clm/out/ped_backup_04062024/ to be exact.
This is baffling me because min_freq (PR #172) is something we merged after aggregating CV folds (PR #150). Also checking out to a previous branch that introduce min_freq doesn't include train SMILES on the all_tc_file.
PR #150 aggregated the results from individual CV folds for
train
and incorporated it in*_all_{metric}_CV_ranks_structure.csv
flies but missed it on*_all_{metric}_CV_tc.csv
files. We failed to notice this because we weren't using thetc.csv
file until now.The test files also don't reflect the missing SMILES for
source=train
because we haven't updated those in a while.A particular run of the pipeline on della seem to have train SMILES incorporated on
*min{min_freq}_all_{metric}_CV_tc.csv.gz
.This is baffling me because
min_freq
(PR #172) is something we merged after aggregating CV folds (PR #150). Also checking out to a previous branch that introducemin_freq
doesn't include train SMILES on theall_tc_file
.