ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

Make sure all unused objects are preserved by read-write cycle #144

Closed benmwebb closed 3 weeks ago

benmwebb commented 3 weeks ago

If the user provides an input file to make_mmcif containing one or more unused objects - i.e. a table row with an ID that is not used anywhere else, such as an ihm_geometric_object_transformation that is not used by any geometric object - we should preserve this on output, as the archive folks rely on this behavior in their pipeline. python-ihm requires that all Python objects are ultimately referenced by the top-level System object. Generally we deal with unused objects by keeping a reference to them in an "orphan" list in the System object. But not all objects have orphan lists and so will be lost if they are unused.

To see all potentially lost classes, see all instantiations of the IDMapper class (or subclasses) in reader.py that have None as the first argument. We should either add an orphan list for each such class, or have some sort of catch-all list (although that would complicate output of those objects by the dumpers). Either way, the list should probably be not part of the API for now as there is little reason to create such objects outside of the "preserve an existing file" behavior.

benmwebb commented 3 weeks ago

Note though that IDMapper(None is a superset of all lost classes. There's a bunch of classes in there that aren't lost even without an orphan list. For example ihm.source.Synthetic objects don't have an orphan list but won't be lost, because the _pdbx_entity_src_syn table has to contain an entity_id. Thus on read, an Entity object is created which keeps a reference to the Synthetic object. Since Entity objects are tracked, we don't need to keep an orphan list for Synthetic. So we should check all such potential-orphan tables to see if they also contain an ID of a tracked object.

benmwebb commented 3 weeks ago

I think with careful construction a user could create an input file that results in the following unused objects:

We should add a test that reads an mmCIF file with each of these tables, writes a new file, and asserts that the new file contains all the same tables.

benmwebb commented 3 weeks ago

It would be very difficult to completely preserve the _struct_ref_seq, _struct_ref_seq_dif and _ihm_entity_poly_segment tables since we kind of rely on Entity objects being available to instantiate ihm.reference.Sequence, ihm.reference.Alignment and ihm.AsymUnitRange objects. But files lacking entity tables are probably unusual. Thus, closing this for now.