[Question] bulk reading mmcif coordinates into a numpy array

project-gemmi / gemmi

macromolecular crystallography library and utilities

https://project-gemmi.github.io/

Mozilla Public License 2.0

231 stars 46 forks source link

[Question] bulk reading mmcif coordinates into a numpy array #314

Closed marinegor closed 6 months ago

marinegor commented 6 months ago

Hi everyone,

I wonder if there's a (proper) way to read the coordinates and indices of e.g. all atoms in a particular chain into a numpy array?

For now, the way I see it working is:


# read only ATOM groups of chain 

model = gemmi.read_structure('6rz6.cif')[0]
residues = [res for res in model['A'] if res.het_flag == 'A']
arr = np.array([[at.serial, *at.pos] for res in residues for at in res])

ids, xyzs = arr[:, 0].astype(int), arr[:, 1:]

It feels that manually iterating over a gemmi.Chain object is slow (e.g. took 0.3s for a small 6RZ6 file), but I couldn't find any other way in documentation to do that. Could you say if it exists?

wojdyr commented 6 months ago

I tried it on a quite old computer and it was order of magnitude faster. Perhaps 0.3s includes startup of the python interpreter?

That said, Python code and python bindings can be slow. Next month I'll switch to nanobind which promises to have ~10× lower runtime overhead. Currently, the slowest part in this example is *at.pos. If you change it to *at.pos.tolist() it'll be 2-3x faster.

Note that non-standard residues in a polymer are HETATM not ATOM, so relying on het_flag may not work. You could use model['A'].get_polymer() instead.

marinegor commented 6 months ago

I tried it on a quite old computer and it was order of magnitude faster. Perhaps 0.3s includes startup of the python interpreter?

I was already in a ipython interpreter, so probably no. The exact command was %time arr = ...

That said, Python code and python bindings can be slow. Next month I'll switch to nanobind which promises to have ~10× lower runtime overhead. Currently, the slowest part in this example is at.pos. If you change it to at.pos.tolist() it'll be 2-3x faster.

Indeed, 6x faster! Thanks a lot :)

Note that non-standard residues in a polymer are HETATM not ATOM, so relying on het_flag may not work. You could use model['A'].get_polymer() instead.

Thanks again!