jonathanking / sidechainnet

An all-atom protein structure dataset for machine learning.
BSD 3-Clause "New" or "Revised" License
322 stars 36 forks source link

Problematic structures? #58

Closed KirillShmilovich closed 8 months ago

KirillShmilovich commented 1 year ago

Thanks for making this resources available to the public!

I've been exploring some structures in sidechainnet and noticed some unphysical and problematic renderings occuring, for example:

thinning = 30
d = scn.load(casp_version=12, thinning=thinning)
example = 169
set_name = 'train'
seq, ang, crd, mask, sec = d[set_name]['seq'][example], d[set_name]['ang'][example], d[set_name]['crd'][example], d[set_name]['msk'][example], d[set_name]['sec'][example]
res = d[set_name]['res'][example]
name = d[set_name]['ids'][example]
print(f"\nExample using {name}.\n")
print(f"Sequence, Mask, and Secondary Structure:\n{seq}\n{mask}\n{sec}\n")
print(f"Angles:\n{ang[:3]} ...\n")
print(f"Coordinates:\n{crd[:3]} ...\n")
print(f"Resolution:\n{res} A")
# visualize
sb1 = scn.StructureBuilder(seq, crd=crd)
[sb1.to](http://sb1.to/)_3Dmol()

Produces the following output: image

Noticed the collapsed terminal residue and unphysical bonds in a number of the side-chains. Other problematic structures can also be seen using example=6345 or example=1000 in the code above.

Do you have any ideas/fixes for why these structures are being improperly resolved with sidechainnet?

jonathanking commented 1 year ago

First, thanks for your interest and being willing to share any problems you’ve encountered. This is definitely looks wrong, so I’m glad you are bringing this to my attention!

Second, regarding the issue itself, this can happen when multiple experimental residues are included in a structure at the location of a single biological residue position. It happens when the notation for an alternative residue position is not presented in a way that downstream software would be able to determine that both versions of the residues that are overlapping are actually referring to a single residue. For instance, I’ve noticed (and tried my best to exclude) similar issues like #38.

I’ll need some time to look at them more closely and submit a patch, so thanks for your patience in the meantime. I’d recommend skipping over them by their id if you need them for training. Also, I would recommend loading the data as a SCNDataset object via scn.load(…scn_dataset=True). You can see the README for more information on that if it’s not clear.

Cheers!

jonathanking commented 8 months ago

I looked into this a little more, and that's actually the geometry of the structure (see https://www.rcsb.org/structure/2i2j, model 1). The same one is used in ProteinNet. Thanks for letting me know though! I'll be excluding these from the next release since they're not appropriate for training models.