ihmwg / python-modelcif

Python package for handling ModelCIF mmCIF and BinaryCIF files
MIT License
10 stars 1 forks source link

not_modeled_residue_ranges in python-ihm #37

Closed gtauriello closed 2 months ago

gtauriello commented 2 months ago

@benmwebb I ran into a new issue as I updated with the latest python-ihm (related to https://github.com/ihmwg/python-ihm/issues/150)...

There is now a required not_modeled_residue_ranges attribute for models which is not set by default for modelcif.model.Model and its subclasses (which makes it fail in the dumper afterwards).

Basically I was expected it to "just work" without any changes on my end: i.e. I didn't think I would need to add not_modeled_residue_ranges to my model classes and I expected to have default behaviour which checks which residues were covered by coordinates and fills that table automatically (e.g. as I think is done in _get_not_modeled_residues in ihm/util/make_mmcif.py?). For the current model set I am working on, I will work around it of course, but just wondering if this can be handled automatically...

Also @brindakv: I was not aware of _pdbx_poly_seq_scheme needing that behaviour for missing residues. We had such cases in ModelArchive so far and will need to remediate them (e.g. ma-fesnov-nov003nm with "UNK" in entity sequence which are not covered by coordinates). Are there any other fields which need to be considered when residues are missing?

gtauriello commented 2 months ago

Forgot to attach my example code: test_model.py.zip

So basically I would have expected this to work without me having to add not_modeled_residue_ranges in MyModel and that it would be auto-filled based on what residues actually have coordinates...

benmwebb commented 2 months ago

This is surely a bug, as the not-modeled-residues table is for IHMCIF, not ModelCIF. It's not supposed to be usable (or required) here. I will fix.

gtauriello commented 2 months ago

I do not understand why this is not supposed to be usable or required in ModelCIF. The "pdbx_poly_seq_scheme" category is populated here as well and is not IHM specific. So if correct mmCIF files are expected to have '?' in "pdbx_poly_seq_scheme" for non-resolved residues, we need to have this in ModelCIF as well. Or am I missing something?

benmwebb commented 2 months ago

I do not understand why this is not supposed to be usable or required in ModelCIF.

The _ihm_residues_not_modeled table is defined in IHMCIF, not ModelCIF, so there's nowhere to write NotModeledResidueRange objects to in the mmCIF file.

The "pdbx_poly_seq_scheme" category is populated here as well and is not IHM specific. So if correct mmCIF files are expected to have '?' in "pdbx_poly_seq_scheme" for non-resolved residues, we need to have this in ModelCIF as well. Or am I missing something?

In principle you could create NotModeledResidueRange objects in python-modelcif and add them to your model, just like you can do in python-ihm, and they should still result in the behavior @brindakv is asking for in pdbx_poly_seq_scheme. But this information will be lost if you load the mmCIF file back in.

gtauriello commented 2 months ago

Ok. I see the point on _ihm_residues_not_modeled with the extra information but I am still not sure why they need to be listed explicitly to correctly fill pdbx_poly_seq_scheme. I mean in ModelCIF (and mmCIF) unresolved residues are simply defined by not being covered in atom_site, right? I see that in IHM there are other coarse-graining categories which can be used additionally but that's a different story...

Here my point is mainly that assuming pdbx_poly_seq_scheme needs '?' for unresolved residues in valid mmCIF files, we currently do not produce valid mmCIF files in python-modelcif if there are unresolved residues. And if that information is automatically generated based on entity sequence and atom_site, there should also not be any issues when loading and rewriting mmCIF files...

benmwebb commented 2 months ago

Ok. I see the point on _ihm_residues_not_modeled with the extra information but I am still not sure why they need to be listed explicitly to correctly fill pdbx_poly_seq_scheme. I mean in ModelCIF (and mmCIF) unresolved residues are simply defined by not being covered in atom_site, right?

As I mentioned at ihmwg/python-ihm#150, python-ihm (and thus python-modelcif) outputs pdbx_poly_seq_scheme before it has seen any atom_site information, so can't fill in any gaps this way.

And if that information is automatically generated based on entity sequence and atom_site, there should also not be any issues when loading and rewriting mmCIF files...

It is not "automatic"; there is specific logic in the "prepare an IHMCIF file for deposition" script, make_mmcif, at https://github.com/ihmwg/python-ihm/blob/296ce06af/ihm/util/make_mmcif.py#L64-L95

gtauriello commented 2 months ago

Given Brinda's answer by email, I would suggest that we reopen this issue.

The main problem is that we need a way in python-modelcif to correctly populate pdbx_poly_seq_scheme in the (very much existing) cases where models do not contain coordinates for every residue in the entity.

I don't know what a good technical solution to this problem is but some thoughts on that:

benmwebb commented 2 months ago

This is an already-solved problem in python-ihm, so we can just use the same solution in python-modelcif. I have a prototype already and I will commit this later today after testing. Essentially the make_mmcif script will

It's up to you whether you run your files through make_mmcif or it is done by the archive folks. For IHM the equivalent script is run as part of the deposition pipeline.