Closed aozalevsky closed 1 year ago
Merging #118 (8f06848) into main (104640e) will decrease coverage by
2.76%
. The diff coverage is100.00%
.
From a user perspective, it's a somewhat unintuitive behavior. mmCIF is a table-based format, so i see a table in the mmcif file, i see the field names and expect to be able to access this information.
For instance, now you can get asym and atom from xl, but not the residue (neither residue number nor residue name). You have to guess, that residue is linked inside the experimental_crosslink, but the Residue
object doesn't have a comp_id
attribute. In other words, you have to make multiple hops to get a) the data, that typically goes and accessed together b) the data that is already in the table.
I don't see a reason why one would want to impose a non-redundant scheme (doing essentially manual filtering/separation of the information ) if redundancy is a part of the format.
Linking to instances is completely ok unless it obscures the information. Again, as in the example with Residue
, which doesn't have a comp_id
attribute, it essentially removes the information that was already accessible and forces you to do several additional hops to get a residue name. The same goes to id
and group_id
for a specific restraint. The id
is useful when, for instance, you want to select a specific restraint from a file. And for some classes ids are preserved:
why not be consistent?
From a user perspective, it's a somewhat unintuitive behavior. mmCIF is a table-based format, so i see a table in the mmcif file, i see the field names and expect to be able to access this information.
That is not how python-ihm is designed. The internal representation is a hierarchy of Python objects, not a bunch of tables.
For instance, now you can get asym and atom from xl, but not the residue (neither residue number nor residue name). You have to guess, that residue is linked inside the experimental_crosslink, but the
Residue
object doesn't have acomp_id
attribute. In other words, you have to make multiple hops to get a) the data, that typically goes and accessed together b) the data that is already in the table.
No guessing is required. We can certainly add convenience properties where necessary to reduce the number of hops though.
Linking to instances is completely ok unless it obscures the information. Again, as in the example with
Residue
, which doesn't have acomp_id
attribute
Nothing has a comp_id
attribute. The only place this information is stored is ChemComp.id
. Everything else must reference a ChemComp
by design. In your example it would be trivial to add a comp
property to Residue
if that's what you need. Then you can just say r.comp.id
, no duplication needed.
I've added convenience accessors so this information should be available for a given ResidueCrossLink
object xl
:
xl._id
xl.experimental_cross_link._id
xl.residue1.seq_id
xl.residue2.seq_id
xl.residue1.comp.id
xl.residue2.comp.id
This should work for the majority of depositions. If memory serves there are one or two where they elected to enforce cross-links on different residues from those identified experimentally (these are easy to see because the comp_ids in the mmCIF file are not all LYS, for example). python-ihm doesn't currently handle that; see #119.
(If your intention is to preserve data exactly as read from the mmCIF file, python-ihm probably isn't the best tool for the job because it is not designed to do that. Although you can use its low-level classes if you want to read the file as a bunch of tables, there are other tools such as Biopython which can do that too.)
Some mandatory attributes are missing from the interface in the current version: