Closed gtauriello closed 2 months ago
Forgot to attach my example code: test_model.py.zip
So basically I would have expected this to work without me having to add not_modeled_residue_ranges
in MyModel
and that it would be auto-filled based on what residues actually have coordinates...
This is surely a bug, as the not-modeled-residues table is for IHMCIF, not ModelCIF. It's not supposed to be usable (or required) here. I will fix.
I do not understand why this is not supposed to be usable or required in ModelCIF. The "pdbx_poly_seq_scheme" category is populated here as well and is not IHM specific. So if correct mmCIF files are expected to have '?' in "pdbx_poly_seq_scheme" for non-resolved residues, we need to have this in ModelCIF as well. Or am I missing something?
I do not understand why this is not supposed to be usable or required in ModelCIF.
The _ihm_residues_not_modeled
table is defined in IHMCIF, not ModelCIF, so there's nowhere to write NotModeledResidueRange
objects to in the mmCIF file.
The "pdbx_poly_seq_scheme" category is populated here as well and is not IHM specific. So if correct mmCIF files are expected to have '?' in "pdbx_poly_seq_scheme" for non-resolved residues, we need to have this in ModelCIF as well. Or am I missing something?
In principle you could create NotModeledResidueRange
objects in python-modelcif and add them to your model, just like you can do in python-ihm, and they should still result in the behavior @brindakv is asking for in pdbx_poly_seq_scheme
. But this information will be lost if you load the mmCIF file back in.
Ok. I see the point on _ihm_residues_not_modeled with the extra information but I am still not sure why they need to be listed explicitly to correctly fill pdbx_poly_seq_scheme. I mean in ModelCIF (and mmCIF) unresolved residues are simply defined by not being covered in atom_site, right? I see that in IHM there are other coarse-graining categories which can be used additionally but that's a different story...
Here my point is mainly that assuming pdbx_poly_seq_scheme needs '?' for unresolved residues in valid mmCIF files, we currently do not produce valid mmCIF files in python-modelcif if there are unresolved residues. And if that information is automatically generated based on entity sequence and atom_site, there should also not be any issues when loading and rewriting mmCIF files...
Ok. I see the point on _ihm_residues_not_modeled with the extra information but I am still not sure why they need to be listed explicitly to correctly fill pdbx_poly_seq_scheme. I mean in ModelCIF (and mmCIF) unresolved residues are simply defined by not being covered in atom_site, right?
As I mentioned at ihmwg/python-ihm#150, python-ihm (and thus python-modelcif) outputs pdbx_poly_seq_scheme
before it has seen any atom_site
information, so can't fill in any gaps this way.
And if that information is automatically generated based on entity sequence and atom_site, there should also not be any issues when loading and rewriting mmCIF files...
It is not "automatic"; there is specific logic in the "prepare an IHMCIF file for deposition" script, make_mmcif
, at https://github.com/ihmwg/python-ihm/blob/296ce06af/ihm/util/make_mmcif.py#L64-L95
Given Brinda's answer by email, I would suggest that we reopen this issue.
The main problem is that we need a way in python-modelcif to correctly populate pdbx_poly_seq_scheme in the (very much existing) cases where models do not contain coordinates for every residue in the entity.
I don't know what a good technical solution to this problem is but some thoughts on that:
not_modeled_residue_ranges
. We can introduce a modelcif.model.NotModeledResidueRange
class for it which is identical to ihm.model.NotModeledResidueRange
but without the reason
attribute since that one is not supported in ModelCIF.pdbx_poly_seq_scheme
and atom_site
, I suppose one could produce a warning/error if the explicitly defined not_modeled_residue_ranges
doesn't match what we just wrote into the file?not_modeled_residue_ranges
. I fail to see how the dumper should not be able to loop over all models, check what residues are covered in the model and use the diff to the entity sequence to set not_modeled_residue_ranges
. If the only worry for it is about having to run Model.get_atoms
twice to get the job done and that this would produce a meaningful overhead in runtime, I would suggest a logic where setting not_modeled_residue_ranges
explicitly (i.e. not None
) would skip the automatic processing of all atoms.not_modeled_residue_ranges
while reading should not be a problem I think.This is an already-solved problem in python-ihm, so we can just use the same solution in python-modelcif. I have a prototype already and I will commit this later today after testing. Essentially the make_mmcif
script will
atom_site
NotModeledResidueRange
objects for each such rangepdbx_poly_seq_scheme
It's up to you whether you run your files through make_mmcif
or it is done by the archive folks. For IHM the equivalent script is run as part of the deposition pipeline.
@benmwebb I ran into a new issue as I updated with the latest python-ihm (related to https://github.com/ihmwg/python-ihm/issues/150)...
There is now a required
not_modeled_residue_ranges
attribute for models which is not set by default formodelcif.model.Model
and its subclasses (which makes it fail in the dumper afterwards).Basically I was expected it to "just work" without any changes on my end: i.e. I didn't think I would need to add
not_modeled_residue_ranges
to my model classes and I expected to have default behaviour which checks which residues were covered by coordinates and fills that table automatically (e.g. as I think is done in _get_not_modeled_residues in ihm/util/make_mmcif.py?). For the current model set I am working on, I will work around it of course, but just wondering if this can be handled automatically...Also @brindakv: I was not aware of _pdbx_poly_seq_scheme needing that behaviour for missing residues. We had such cases in ModelArchive so far and will need to remediate them (e.g. ma-fesnov-nov003nm with "UNK" in entity sequence which are not covered by coordinates). Are there any other fields which need to be considered when residues are missing?