ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

Dataset ordering #114

Open aozalevsky opened 1 year ago

aozalevsky commented 1 year ago

A branched topology of datasets (see below) breaks a logic dataset order in the mmcif file.

With a topology like this

A00 (primary)
|
A10

     B00 (primary)
    /    \ 
B10       B11
(parent) 
|
B20

the Dataset table looks like this:

B00
B11
A00
A10
B10
B20

But if i delete a parallel node B11, everything is ordered in a more reasonable manner:

A00
A10
B00
B10
B20

I'm adding B11 to the protocol as po.system.orphan_datasets.append(B11). Am I missing something, or is this a bug?

benmwebb commented 1 year ago

I'm not sure what you mean by "logic dataset order" but the IHM dictionary doesn't mandate ordering for any table IIRC. python-ihm generally will output objects in a consistent but unsorted order. In the case of datasets they will be output in the same order they're encountered in the Python object hierarchy. This is not a bug unless the output dataset IDs are actually wrong.

BTW, generally it should not be necessary to place objects in the various orphan_ lists. python-ihm stores objects in a hierarchy, so generally something like a Dataset should be referenced by another object in that hierarchy (such as a Restraint or another Dataset). But on reading an mmCIF file it's possible that an object such as a dataset is listed in a table but nothing refers to it. The orphan_ lists are provided to keep references to such objects.

aozalevsky commented 1 year ago

My first impression was that it should, just as you said, traverse the hierarchy (something like a depth-first search, starting from primary datasets). But it looks like object hierarchy is more complicated. For instance, in the example above, objects were created in the following order: B20, B00, B10, B11. and yet B00-B11 end up on top of the list, while B10-B20 are at the bottom.