jonathanking / sidechainnet

An all-atom protein structure dataset for machine learning.
BSD 3-Clause "New" or "Revised" License
330 stars 38 forks source link

Missing sets / resolution #64

Closed thamelry closed 6 months ago

thamelry commented 9 months ago

Hi,

For the CASP12 data set:

Best regards,

-Thomas

jonathanking commented 9 months ago

Hi Thomas,

Thanks for your interest!

Those validation set splits do not exist. Only splits of 10, 20, 30, 40, 50, 70, 90% are created. I encourage you to check out the ProteinNet paper if you would like to learn more about the original design that we build upon in SidechainNet.

The test set proteins are downloaded from the CASP contest website, not the RCSB PDB. As such, they have special CASP-specific identifiers, and the resolution is not available from the RCSB PDB (which is where SidechainNet looks for resolution data).

Best, Jonathan

jonathanking commented 6 months ago

Closing this as I go through issues, but feel free to let me know if you have any further questions.