nglviewer / ngl

WebGL protein viewer
http://nglviewer.org/ngl/
MIT License
667 stars 170 forks source link

NGL Viewer displays incorrect residue numbers for PDB Hybrid36 format #810

Open ChrisMoth opened 3 years ago

ChrisMoth commented 3 years ago

.... and that maybe fine.... If you decide this is of no interest, that is very helpful too!

Details:

Structures Models of long transcripts (like Human gene TTN with over 30,000 residues) need residue numbering > 9999

Obviously, MMCIF format is the solution of the RCSB/PDBe

The UCSF Sali Lab Modbase group has adopted hybrid-36 format for these models. I left one of their models here:

https://structbio.vanderbilt.edu/~mothcw/modbase_pdb36_example/

NGL Viewer loses the residue numbers from these structures. But, it does seem to retain the atom coordinates (impressive).

If you were inspired to be able to parse hybrid36 here are more details:

http://cci.lbl.gov/hybrid_36/

Chimera and ChimeraX auto-detect load hybrid-36 files automatically. They both save the files, but I worry Chimera uses 5 column residue numbering and not hybrid36 numbering at that juncture.

Thanks for thinking about this. If you firmly decide to not implement hybrid 36, that is also helpful. In that case, I will continue to request Sali Lab to move to MMCIF as default format, as the RCSB has done. Your decision will also help my bug report to the UCSF ChimeraX team, which is curently not writing secondary structure annotations to its mmcif outputs.

fredludlow commented 3 years ago

Hi @ChrisMoth,

It looks like a small-ish change to the parser (if it's not an int, try decoding it from hybrid36 format). Would be happy to review/merge a PR that does this, it would need a JS decoder function, which would get called about here when required: https://github.com/nglviewer/ngl/blob/master/src/parser/pdb-parser.ts#L291

Currently there is support for hex numbering (enabled via a specific flag for the PDB Parser) so there's also a question of what the default behaviour should be, when you need a flag etc.

ChrisMoth commented 3 years ago

I'd hate to encumber nglviewer for a small set of structures.

Can you tell me more about simple hex encoding you already support? If I could write my structures as hex atom and residue IDs, from biopython (or a quickly hacked biopython), I'm sure that would be great, since 4 bytes of hex takes us to 65536 residues, and logest I kinow is TTN at 35,000 or so amino acid positions. Do I use a different file extension from ".pdb"? How do I set the PDB Parser tro be hex aware, from the stage.loadFile() call?

I wonder if "parseInt" is something that should be settable from the stage.loadFile call. I'd be happy with requiring two script references, something like:

stage.loadFile("4cox.cif", {defaultRepresentation: true, ParseInt=hybrid36ParseInt});

I am not super-fond of "try hybrid 36 if simple int fails to parse" because I would expect more often for parse of ints to fail due to fundamental file corruption. The uses of NGL should know that their pdb file has residues > 9999 and also that they are using hybrid36 (or I suppose, straight hex - where can I learn more about that???)

Sorry to be a bit long winded. In summary:

1) HEX would solve all my problems without changing nglviewer. Tell me more about standards, and how to tell NGL "my file uses hex".

2) If you are still interested in a pull request for hybrid36, please send along a little more guidance about how you think that would look for an NGL coder.... and I'll give it a whirl.

Thanks(!)

garyo commented 3 years ago

The sample encoders/decoders on the page you linked, @ChrisMoth , look like they parse ints just fine as well as the extended formats, so they could be drop-in replacements for parsing those ATOM and residue fields. (For speed, I'd give them special cases for ints, i.e. when digit[0] is in [0-9] or when atom <= 99999 or residue <= 9999.) That seems like a good simple extension to me.

fredludlow commented 3 years ago

I'm not aware of official spec for the hex format. Here's the original issue:

https://github.com/nglviewer/ngl/issues/241

And the commit that introduces it https://github.com/nglviewer/ngl/commit/e12d64bfa72afd9db4ee69937330a1eab915d419

It looks like it'll only work where you first have seen resno 9999, I'm guessing for the same reasons the hybrid36 format doesn't start to use letters till you get to 9999 decimal (backwards compatibility for resno <= 9999).

If you want to add hybrid36 support (which I'd be very happy to have) then I'd probably follow a similar convention e.g.

stage.loadFile('/url/of/file.pdb', { hybrid36: true } ) and then use a port of the decoder function. (It might be neatest to make a decoder that can do both hex and hy36 using a similar convention of interpreting <=9999 as decimal?)