ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

entity.type of single-amino acid entities #70

Closed bienchen closed 2 years ago

bienchen commented 2 years ago

I do compare mmCIF files generated with python-ihm against PDB entries, as a test. Now I stumbled up on the entity.type of a single glycine as a compound in PDB entry 3LMS. In the PDB mmCIF file, the glycine is marked as non-polymer entity. Therefore it has a pdbx_entity_nonpoly entry and no entity_poly entry. In the chem_comp category, it's marked as 'peptide linking'. When I run python-ihm to write a mmCIF file with a single glycine, its marked as polymer, gets an entry in entity_poly and no pdbx_entity_nonpoly. I would expect python-ihm to handle entities the same way as the PDB does.

Here is an example to reproduce the behaviour:

import tempfile

import ihm
import ihm.dumper

system = ihm.System()

gly = ihm.PeptideChemComp("GLY", "G", "G", name="GLYCINE", formula="C2 H5 N O2")

gly_entity = ihm.Entity(("G",), description="GLYCINE")

system.entities.extend((gly_entity,))

with tempfile.TemporaryFile(mode="w+", encoding="utf8") as fp:
    ihm.dumper.write(fp, [system])
    fp.seek(0)
    for line in fp:
        if line.startswith(("_entity.", "_entity_poly.")):
            print(line.strip())
        elif line.startswith("1 polymer"):
            print(line.strip())
            print("     ↑")
            print("should be 'non-polymer'\n")
        elif line.startswith("1 1 GLY ."):
            print(line.strip())
            print("                    ↑")
            print(
                "There shouldn't be a 'entity_poly' entry but "
                + "'pdbx_entity_nonpoly'\n"
            )

print("\nExpected outcome:\n")

class MyEntity(ihm.Entity):
    def is_polymeric(self):
        return len(self.sequence) != 1

gly_entity = MyEntity(("G",), description="GLYCINE")

system.entities = [gly_entity]

with tempfile.TemporaryFile(mode="w+", encoding="utf8") as fp:
    ihm.dumper.write(fp, [system])
    fp.seek(0)
    for line in fp:
        if line.startswith(
            ("_chem_comp.", "_entity.", "_pdbx_entity_nonpoly.")
        ):
            print(line.strip())
        elif line.startswith("GLY"):
            print(line.strip())
            print("            ↑")
            print("Glycine still marked as 'peptide linking'\n")
        elif line.startswith("1 non-polymer"):
            print(line.strip())
            print("       ↑")
            print("Glycine marked 'non-polymer' as an entity\n")
        elif line.startswith("1 GLYCINE GLY"):
            print(line.strip())
            print("              ↑")
            print("'pdbx_entity_nonpoly' entry for glycine")

The second half of the example writes the mmCIF file as I would expect it by overriding ihm.Entity.is_polymer(), but I'm not sure if this does not break the scheme at a different place.

benmwebb commented 2 years ago

I think your fix for is_polymer() should be fine. I'll commit a fix. A single-residue entity is likely so rare in regular usage that I doubt anything will break[*], but we should do what PDB does.

[*] Lots of ihm (and probably ma) categories only work with polymeric entities. For example a cross-link cannot be made to a non-polymer since it uses seq_id which is only defined for polymers. But a cross-link (or a model of any kind) involving single-residue entities probably makes no sense anyway.

bienchen commented 2 years ago

I tested the fix and it works as expected (I prepared unit tests for all cases with my lazy fix). Thanks a lot for the fast response!