tiwarylab / GrASP

Graph Attention Site Prediction (GrASP): Identifying Druggable Binding Sites Using Graph Neural Networks with Attention
MIT License
49 stars 5 forks source link

Unaccounted Hybridisation states #8

Open JavierSanchez-Utges opened 8 months ago

JavierSanchez-Utges commented 8 months ago

Many of the proteins on my dataset are crashing on the featurize_protein.py, with a KeyError. The hybridisation state of the atom is SP, but only SP2 and SP3 are accounted for in the dictionary. How could this be fixed? Thanks!

Michael-C-Strobel commented 8 months ago

Thank you for bringing this problem to our attention. Usually this is an issue because OpenBabel is returning an SP atom when it infers bonds, which we wouldn't expect for amino acids. You could try minimizing the structures in your dataset or checking for missing atoms.

JavierSanchez-Utges commented 8 months ago

That is interesting. For what I see, it happens multiple times n in my structure dataset, comprised by 2K human protein structures from PDBe. Perhaps the vector representation of hybridisation could be modified to a 3-element vector instead of 2? But I guess it would be a different model then, different features.

I have noticed another KeyError: 'SP3D', e.g., atom 784 of PDB: 4y88, chain A. For KeyError: 'SP', atom 3545 of PDB: 6en6, chain D.

zwsmith200 commented 8 months ago

Are these modified amino acids? Our policy is that we would rather have GrASP crash when we see something non-standard or low-resolution (when OB fails bond perception) so we aren't silently making predictions on features it has never seen.

JavierSanchez-Utges commented 8 months ago

Both of these examples, and a few others of atoms crashing due to unaccounted hybridisation states are all altLoc atoms. It might be that because of the multiple alternative locations and proximity of the atoms, wrong bonds are being inferred? The structures are really good quality.

Perhaps a step to deal with altlocs might solve this.

zwsmith200 commented 8 months ago

Okay, that makes sense. OB is probably parsing both altLocs which breaks bond perception. As far as I understand, MDAnalysis doesn't have a standard way to handle these for us to pre-process them. If it's not too many I would fix the input structures by hand or if you find a robust way to handle them I can look into adding it.

zwsmith200 commented 8 months ago

I will add a check/warning that detects altLocs in the parsing code to save time debugging in the future.

JavierSanchez-Utges commented 8 months ago

So, there is this script: https://github.com/harryjubb/pdbtools/blob/master/clean_pdb.py from Harry Jubb's group. It was to pre-process the structures before running an older version of Arpeggio (https://github.com/harryjubb/arpeggio). Takes PDB format as input, and deals with altLocs, chain breaks, etc. I will try running it and then run parse_files.py, see if that helps these issues.

zwsmith200 commented 8 months ago

I recommend printing something when there are altLocs so you have a record of those systems. @bodhivani said she looked at both when working with them in case reorganization changed how accessible the site was and/or changed the predictions.