jozhang97 / MutateEverything

60 stars 6 forks source link

Missing cDNA Data #3

Open KiAkize opened 9 months ago

KiAkize commented 9 months ago

Thank you for sharing your code and data :) However, I encountered an issue while examining the cDNA data. I noticed that several PDB entries, which provide corresponding MSAs and FASTA files, are not listed in cdna_train.csv or cdna2_test.csv.

Could it be possible that the mutation data records for these PDB entries were omitted during the compilation process? (Here are the PDB codes):

6scw, 5ubs, 5uce, 3cqt
jozhang97 commented 9 months ago

Thanks for your interest! It is indeed possible for a small number of proteins in the cDNA dataset to not be in the training or test set based on how we created the splits.

We constructed our cDNA training set by filtering out any proteins similar to those in the literature test sets. Of the filtered proteins, we constructed a test set (cdna2) from the proteins that are dissimilar to the cDNA training set.