Closed jixiang-all closed 11 months ago
The question was answered through email communication.
Hi, I'd also like to know how to split the datasets for five-fold cross validation with specific sequence similarity (i.e. 50%). For this process, must we ensure that all the ec numbers appears in each training set?
Thanks for your answers in advance and look forward to your response!
I'm also curious. Could you tell me how the validation split is done in each dataset
Hi,
Thanks for contributing the code! I was wondering the details about your cross-validation process.
I conducted a standard K-fold split on the
split10.csv
for cross validation. Then I replicated the training process and train a CLEAN model with triplet loss on the training set(in CV). The final F1 score on the validation set(in CV) is around 0.5.Therefore, I am curious about how to obtain the results shown in Figure S1. Did you just use a standard random cross validation split, or use some tricks(like let the EC numbers of the validation set in CV has at least one sample in the training set in CV)?
Meanwhile, I am also curious about how did you spliet the "understudied validation dataset" shown in Figure S2. How did you maintain the "no more than 5 times"? Are there some samples not used neither in training nor inferencing(to get the result in Figure S2)?
Thanks for your answers in advance and look forward to your response!