tttianhao / CLEAN

CLEAN: a contrastive learning model for high-quality functional prediction of proteins
MIT License
217 stars 41 forks source link

The cross-validation process details. #35

Closed jixiang-all closed 11 months ago

jixiang-all commented 11 months ago

Hi,

Thanks for contributing the code! I was wondering the details about your cross-validation process.

I conducted a standard K-fold split on the split10.csv for cross validation. Then I replicated the training process and train a CLEAN model with triplet loss on the training set(in CV). The final F1 score on the validation set(in CV) is around 0.5.

Therefore, I am curious about how to obtain the results shown in Figure S1. Did you just use a standard random cross validation split, or use some tricks(like let the EC numbers of the validation set in CV has at least one sample in the training set in CV)?

Meanwhile, I am also curious about how did you spliet the "understudied validation dataset" shown in Figure S2. How did you maintain the "no more than 5 times"? Are there some samples not used neither in training nor inferencing(to get the result in Figure S2)?

Thanks for your answers in advance and look forward to your response!

canallee commented 11 months ago

The question was answered through email communication.

goes0n commented 10 months ago

Hi, I'd also like to know how to split the datasets for five-fold cross validation with specific sequence similarity (i.e. 50%). For this process, must we ensure that all the ec numbers appears in each training set?

Thanks for your answers in advance and look forward to your response!

elgeekim commented 8 months ago

I'm also curious. Could you tell me how the validation split is done in each dataset