Additional training datasets

mj-thompson commented 10 months ago

Hi there.

I'm working on a related project to what you all have developed here, and while I greatly appreciate the availability of the 1.08M pseudo-labelled data, I was wondering whether it would be possible to obtain the other datasets used in this manuscript. Namely, the labelled training, validation, and test sets, as referenced in the text. If they're unavailable, any code used in the preprocessing to replicate the exact steps taken in the manuscript would also be great.

Thank you for your help, I include the quote below for reference.

To accompany the pseudo-labelled sequences, we construct a labelled training set and a labelled validation set from protein structures in the PDB ([Burley et al., 2019]). For proper cross-validation, sequences in both the labelled training and labelled validation sets were removed if they were homologous to any sequences in the CB513 test set, evaluated by CATH ([Sillitoe et al., 2019]) Superfamily-level classification. The final labelled training and validation sets contain 10143 and 534 sequences respectively.

limitloss commented 10 months ago

Hi, thanks for raising this. I should be able to put those sets together over the next few days. Depending on how that goes I'll also try and put together a separate guide for how to go about building generic training set spits based on CATH, time permitting.

limitloss commented 10 months ago

I've added links for downloading the datasets to the README.

psipred / s4pred

Additional training datasets #2