Open JavierSanchez-Utges opened 8 months ago
Thank you for bringing this problem to our attention. Usually this is an issue because OpenBabel is returning an SP atom when it infers bonds, which we wouldn't expect for amino acids. You could try minimizing the structures in your dataset or checking for missing atoms.
That is interesting. For what I see, it happens multiple times n in my structure dataset, comprised by 2K human protein structures from PDBe. Perhaps the vector representation of hybridisation could be modified to a 3-element vector instead of 2? But I guess it would be a different model then, different features.
I have noticed another KeyError: 'SP3D'
, e.g., atom 784 of PDB: 4y88, chain A. For KeyError: 'SP'
, atom 3545 of PDB: 6en6, chain D.
Are these modified amino acids? Our policy is that we would rather have GrASP crash when we see something non-standard or low-resolution (when OB fails bond perception) so we aren't silently making predictions on features it has never seen.
Both of these examples, and a few others of atoms crashing due to unaccounted hybridisation states are all altLoc
atoms. It might be that because of the multiple alternative locations and proximity of the atoms, wrong bonds are being inferred? The structures are really good quality.
Perhaps a step to deal with altlocs might solve this.
Okay, that makes sense. OB is probably parsing both altLocs
which breaks bond perception. As far as I understand, MDAnalysis doesn't have a standard way to handle these for us to pre-process them. If it's not too many I would fix the input structures by hand or if you find a robust way to handle them I can look into adding it.
I will add a check/warning that detects altLocs
in the parsing code to save time debugging in the future.
So, there is this script: https://github.com/harryjubb/pdbtools/blob/master/clean_pdb.py from Harry Jubb's group. It was to pre-process the structures before running an older version of Arpeggio (https://github.com/harryjubb/arpeggio). Takes PDB format as input, and deals with altLocs, chain breaks, etc. I will try running it and then run parse_files.py
, see if that helps these issues.
I recommend printing something when there are altLocs
so you have a record of those systems. @bodhivani said she looked at both when working with them in case reorganization changed how accessible the site was and/or changed the predictions.
Many of the proteins on my dataset are crashing on the
featurize_protein.py
, with aKeyError
. The hybridisation state of the atom isSP
, but onlySP2
andSP3
are accounted for in the dictionary. How could this be fixed? Thanks!