tttianhao / CLEAN

CLEAN: a contrastive learning model for high-quality functional prediction of proteins
MIT License
217 stars 41 forks source link

Dataset contain duplicate structures #14

Closed zas97 closed 1 year ago

zas97 commented 1 year ago

The split100.csv file contains duplicate structures with different uniprot_id but same ec_number. Is that expected?

tttianhao commented 1 year ago

Do you mean same amino acid sequence?

zas97 commented 1 year ago

yes

tttianhao commented 1 year ago

It is not expected. All of our data are downloaded from UniProt, which might contain duplicates in their database.