ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

Add missing crosslink attributes #118

Closed aozalevsky closed 1 year ago

aozalevsky commented 1 year ago

Some mandatory attributes are missing from the interface in the current version:

id
group_id
seq_id_1
seq_id_2
comp_id_1
comp_id_2
codecov[bot] commented 1 year ago

Codecov Report

Merging #118 (8f06848) into main (104640e) will decrease coverage by 2.76%. The diff coverage is 100.00%.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/ihmwg/python-ihm/pull/118/graphs/tree.svg?width=650&height=150&src=pr&token=J0QoOLwwXv&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ihmwg)](https://app.codecov.io/gh/ihmwg/python-ihm/pull/118?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ihmwg) ```diff @@ Coverage Diff @@ ## main #118 +/- ## ========================================== - Coverage 99.76% 97.01% -2.76% ========================================== Files 29 28 -1 Lines 7296 6728 -568 Branches 1749 1121 -628 ========================================== - Hits 7279 6527 -752 - Misses 11 180 +169 - Partials 6 21 +15 ``` | [Impacted Files](https://app.codecov.io/gh/ihmwg/python-ihm/pull/118?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ihmwg) | Coverage Δ | | |---|---|---| | [ihm/reader.py](https://app.codecov.io/gh/ihmwg/python-ihm/pull/118?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ihmwg#diff-aWhtL3JlYWRlci5weQ==) | `99.47% <100.00%> (-0.35%)` | :arrow_down: | ... and [5 files with indirect coverage changes](https://app.codecov.io/gh/ihmwg/python-ihm/pull/118/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ihmwg) ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/ihmwg/python-ihm/pull/118?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ihmwg). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ihmwg) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/ihmwg/python-ihm/pull/118?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ihmwg). Last update [104640e...8f06848](https://app.codecov.io/gh/ihmwg/python-ihm/pull/118?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ihmwg). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ihmwg).
aozalevsky commented 1 year ago

From a user perspective, it's a somewhat unintuitive behavior. mmCIF is a table-based format, so i see a table in the mmcif file, i see the field names and expect to be able to access this information.

For instance, now you can get asym and atom from xl, but not the residue (neither residue number nor residue name). You have to guess, that residue is linked inside the experimental_crosslink, but the Residue object doesn't have a comp_id attribute. In other words, you have to make multiple hops to get a) the data, that typically goes and accessed together b) the data that is already in the table.

I don't see a reason why one would want to impose a non-redundant scheme (doing essentially manual filtering/separation of the information ) if redundancy is a part of the format.

Linking to instances is completely ok unless it obscures the information. Again, as in the example with Residue, which doesn't have a comp_id attribute, it essentially removes the information that was already accessible and forces you to do several additional hops to get a residue name. The same goes to id and group_id for a specific restraint. The id is useful when, for instance, you want to select a specific restraint from a file. And for some classes ids are preserved:

https://github.com/ihmwg/python-ihm/blob/104640eb7ddd96d99099c10fb0b2e43b773cd16f/ihm/reader.py#L1243-L1244C13

why not be consistent?

benmwebb commented 1 year ago

From a user perspective, it's a somewhat unintuitive behavior. mmCIF is a table-based format, so i see a table in the mmcif file, i see the field names and expect to be able to access this information.

That is not how python-ihm is designed. The internal representation is a hierarchy of Python objects, not a bunch of tables.

For instance, now you can get asym and atom from xl, but not the residue (neither residue number nor residue name). You have to guess, that residue is linked inside the experimental_crosslink, but the Residue object doesn't have a comp_id attribute. In other words, you have to make multiple hops to get a) the data, that typically goes and accessed together b) the data that is already in the table.

No guessing is required. We can certainly add convenience properties where necessary to reduce the number of hops though.

Linking to instances is completely ok unless it obscures the information. Again, as in the example with Residue, which doesn't have a comp_id attribute

Nothing has a comp_id attribute. The only place this information is stored is ChemComp.id. Everything else must reference a ChemComp by design. In your example it would be trivial to add a comp property to Residue if that's what you need. Then you can just say r.comp.id, no duplication needed.

benmwebb commented 1 year ago

I've added convenience accessors so this information should be available for a given ResidueCrossLink object xl:

This should work for the majority of depositions. If memory serves there are one or two where they elected to enforce cross-links on different residues from those identified experimentally (these are easy to see because the comp_ids in the mmCIF file are not all LYS, for example). python-ihm doesn't currently handle that; see #119.

(If your intention is to preserve data exactly as read from the mmCIF file, python-ihm probably isn't the best tool for the job because it is not designed to do that. Although you can use its low-level classes if you want to read the file as a bunch of tables, there are other tools such as Biopython which can do that too.)