tttianhao / CLEAN

CLEAN: a contrastive learning model for high-quality functional prediction of proteins
MIT License
224 stars 44 forks source link

Details about data split #24

Closed yangzhao1230 closed 9 months ago

yangzhao1230 commented 1 year ago

Hi,

Thanks for your great work and nice code.

I'm interested in your data split, e.g. 'split10.csv' and 'split100.csv'. There are few details about how to get the splited data in both your paper and code. I guess you preprocess it through comparing data from SwissProt with data from your two test set.

I'd appreciate it if you could give more details about data split either in description or code.

canallee commented 9 months ago

For how to use MMSeqs2 to cluster at different identities, please refer to Supplementary Text 1. ML model development and evaluation. The cross-validation for each split is now updated. Let us know if there are further questions or the a need to upload the exact script for splitting.