Closed yitzchak closed 1 year ago
I do not understand why the SDF format would create 1 atom type for each atom? Have you looked into this?
Yes, the sdf parser makes a new atom type mapping for each atom and stores the returned atom id in the atomTypeId array.
Thanks! Right: the issue is here: https://github.com/nglviewer/ngl/blob/beb8bf93c001fd36a60a2bf6185f1aa5d503f4d3/src/parser/sdf-parser.ts#L105 The sdf parser creates a unique atom name for each atom in the structure which is reasonable when the SDF is used to store small molecules. In V2000 mol format the limitation on the number of atoms is 999. As SDF connectivity table does not have the concept of residue, storing a macromolecule such as a protein in a SDF defeats the atomTypeId mechanism that is used to store information about individual atoms in a residue.
The issue I find with this PR is that the storage for this property is doubled for every atomStore (2 bytes per atom), including for cases when the structure is huge and the optimization on space storage is important, while it is deemed necessary to handle a case where the sdf format is used to store a structure that it is probably not suited for.
Maybe it's not that big a deal (the storage increase). Otherwise, I am wondering wether the number of different atom names should not be capped in the sdf parser directly as it might be useless anyway, and introduces perf concerns (for example there is a residue property which is a hash based on the catenation of all atom names, which in your case could result in a string close to a MB!)
In this case we are storing a collection of molecules in an SDF file. With 1000 molecules there is going to be a lot unique atom type names.
If you want cut down on the number of unique names you could do
const atomname = element + (idx - modelAtomIdxStart + 1)
This would recycle names between models provided the atoms were the same element.
Yes, something like that! Or use a second counter that is reset at each new model
Yes, I think that would be the best thing to do. Would you try it (revert your first commit and submit that change instead)?
Thanks!
Done. Works fine.
Great! Any ETA on the next point/patch release?
Thank you both very much for this. Have just published 2.0.2.
It might be sensible to give @ppillot npm publishing rights too!
The atomTypeId array in AtomStore is currently an array of 16 unsigned integers. This causes problems when loading an SDF file with more than 64k atoms. The atomTypeId overflows at this point and the bonds begin to refer to the wrong atoms. For example a bond between atom 65535 and atom 65537 will turn into a bond between atom 65535 and atom 1.