nglviewer / ngl

WebGL protein viewer
http://nglviewer.org/ngl/
MIT License
657 stars 168 forks source link

Make atom names unique only within a model for SDF #949

Closed yitzchak closed 1 year ago

yitzchak commented 1 year ago

The atomTypeId array in AtomStore is currently an array of 16 unsigned integers. This causes problems when loading an SDF file with more than 64k atoms. The atomTypeId overflows at this point and the bonds begin to refer to the wrong atoms. For example a bond between atom 65535 and atom 65537 will turn into a bond between atom 65535 and atom 1.

ppillot commented 1 year ago

I do not understand why the SDF format would create 1 atom type for each atom? Have you looked into this?

yitzchak commented 1 year ago

Yes, the sdf parser makes a new atom type mapping for each atom and stores the returned atom id in the atomTypeId array.

https://github.com/nglviewer/ngl/blob/beb8bf93c001fd36a60a2bf6185f1aa5d503f4d3/src/parser/sdf-parser.ts#L108

ppillot commented 1 year ago

Thanks! Right: the issue is here: https://github.com/nglviewer/ngl/blob/beb8bf93c001fd36a60a2bf6185f1aa5d503f4d3/src/parser/sdf-parser.ts#L105 The sdf parser creates a unique atom name for each atom in the structure which is reasonable when the SDF is used to store small molecules. In V2000 mol format the limitation on the number of atoms is 999. As SDF connectivity table does not have the concept of residue, storing a macromolecule such as a protein in a SDF defeats the atomTypeId mechanism that is used to store information about individual atoms in a residue.

The issue I find with this PR is that the storage for this property is doubled for every atomStore (2 bytes per atom), including for cases when the structure is huge and the optimization on space storage is important, while it is deemed necessary to handle a case where the sdf format is used to store a structure that it is probably not suited for.

Maybe it's not that big a deal (the storage increase). Otherwise, I am wondering wether the number of different atom names should not be capped in the sdf parser directly as it might be useless anyway, and introduces perf concerns (for example there is a residue property which is a hash based on the catenation of all atom names, which in your case could result in a string close to a MB!)

yitzchak commented 1 year ago

In this case we are storing a collection of molecules in an SDF file. With 1000 molecules there is going to be a lot unique atom type names.

If you want cut down on the number of unique names you could do

const atomname = element + (idx - modelAtomIdxStart + 1)

This would recycle names between models provided the atoms were the same element.

ppillot commented 1 year ago

Yes, something like that! Or use a second counter that is reset at each new model Yes, I think that would be the best thing to do. Would you try it (revert your first commit and submit that change instead)? Thanks!

yitzchak commented 1 year ago

Done. Works fine.

yitzchak commented 1 year ago

Great! Any ETA on the next point/patch release?

fredludlow commented 1 year ago

Thank you both very much for this. Have just published 2.0.2.

It might be sensible to give @ppillot npm publishing rights too!