nglviewer / ngl

WebGL protein viewer
http://nglviewer.org/ngl/
MIT License
667 stars 170 forks source link

Missing residues in chain - pdb 5UJB #534

Closed martingraham closed 6 years ago

martingraham commented 6 years ago

Hi Alex, when I load in pdb id 5UJB into NGL, I get 2 chains, one of length 568 (A) and one of length 565 (B)

However, the fasta file for the pdb file swears blind both chains have 604 residues in them

Looking at the PDB file in a text editor I'm suspicious that it starts numbering its' residues at a negative index (-23). Is this legal for PDBs, and if it is, could it be causing a problem for NGL?

https://www.rcsb.org/structure/5UJB

martingraham commented 6 years ago

ah, looks like the chains A and B are of length 565 and 568 in 5UJB so I reckon not NGL's fault

The trouble is I'm asking one of rcsb's web services for the best match against a HSA sequence I have and it returns 5UJB as the top match - even though 1AO6 actually has slightly longer chains it falls into 2nd place because the sequence it has (seqres) is slightly shorter - the chains in 5UJB cover less of the sequence - I guess I'll go and figure it out from here and complain to rcsb if I can't get round it :-)

I should read up on this

arose commented 6 years ago

Which service at rcsb did you use? Feel free to send a message on https://www.rcsb.org/pages/contactus. Mapping of sequences to PDB entries is often not straightforward. There is the SIFTS project (http://www.ebi.ac.uk/pdbe/docs/sifts/overview.html) which provides up-to-date mappings between sequences of different resources.

martingraham commented 6 years ago

It's the BlastPDB service

https://www.rcsb.org/pdb/rest/getBlastPDB1?sequence=DAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKTCVADESAENCDKSLHTLFGDKLCTVATLRETYGEMADCCAKQEPERNECFLQHKDDNPNLPRLVRPEVDVMCTAFHDNEETFLKKYLYEIARRHPYFYAPELLFFAKRYKAAFTECCQAADKAACLLPKLDELRDEGKASSAKQRLKCASLQKFGERAFKAWAVARLSQRFPKAEFAEVSKLVTDLTKVHTECCHGDLLECADDRADLAKYICENQDSISSKLKECCEKPLLEKSHCIAEVENDEMPADLPSLAADFVESKDVCKNYAEAKDVFLGMFLYEYARRHPDYSVVLLLRLAKTYETTLEKCCAAADPHECYAKVFDEFKPLVEEPQNLIKQNCELFEQLGEYKFQNALLVRYTKKVPQVSTPTLVEVSRNLGKVGSKCCKHPEAKRMPCAEDYLSVVLNQLCVLHEKTPVSDRVTKCCTESLVNRRPCFSALEVDETYVPKEFNAETFTFHADICTLSEKERQIKKQTALVELVKHKPKATKEQLKAVMDDFAAFVEKCCKADDKETCFAEEGKKLVAASQAALGL&eCutOff=10.0&matrix=BLOSUM62&outputFormat=XML

When I let that off it returns a large number of hits ordered by score of which 5UJB is top and say 1AO6 is towards the middle. Both these PDBs contain the input sequence exactly, but 5UJB has some extras at the start which seem to give it a slightly higher score (The difference in scores isn't massive). The trouble is that the chains in 5UJB aren't as long as the ones in 1AO6, so it's actually a slightly worse PDB to use for my purposes, but there's no way to tell this from the returned data. Like you say, I'll fire this in as a question to RCSB

arose commented 6 years ago

Sounds good to ask them, thanks.