ihmwg / python-modelcif

Python package for handling ModelCIF mmCIF and BinaryCIF files
MIT License
10 stars 1 forks source link

Add struct_ref support #34

Closed gtauriello closed 6 months ago

gtauriello commented 6 months ago

Following some discussions with Brinda, I wanted to generate ModelCIF files which additionally to ma_target_ref_db_details also fill the struct_ref category. I figured, I could just do that by adding ihm.reference.Reference objects to the references list in modelcif.Entity (which is identical to ihm.Entity), but that didn't work.

I think that for this to work, one would only need to include _StructRefDumper to the modelcif.dumper and restrict _StructRefDumper to only process subclasses of ihm.reference.Reference and _TargetRefDBDumper to only process subclasses of modelcif.reference.TargetReference. So this should be a fairly minor change which would enable to easily add both types of references.

Do you think this would work?

benmwebb commented 6 months ago

Funny, when I first saw the ma dictionary one of my questions was "why do we have ma_target_ref_db_details when we already have struct_ref"? (I forget @brindakv's answer at the time.)

I think your suggestion could work... but I don't like it very much! References in python-ihm and python-modelcif are not designed to be mixed and matched. You would have to have two sets of objects that contain very similar data, since there's a lot of overlap between the two tables (e.g. UniProt accession). Don't you think we should just have one such object, and use it to populate both tables? This would necessitate a bit of work in python-ihm too, since there are some fields in ma_target_ref_db_details that aren't in struct_ref, like seq_db_sequence_version_date (but perhaps there's another PDBx table where that could go?) Or perhaps we could get rid of ma_target_ref_db_details entirely?

gtauriello commented 6 months ago

Ah well. As Brinda had pointed out, this was historically due to wanting a symmetry between ma_template_ref_db_details and ma_target_ref_db_details (not super important by now given the popularity on de novo models). We did consider getting rid of ma_target_ref_db_details entirely (or deprecating it) but decided against it for practical reasons: 1) it's a breaking change which would need to be applied to all ModelCIF files and 2) ma_target_ref_db_details has 2 useful features missing in struct_ref: seq. version & checksum and a controlled vocabulary for db_name. We could add extra ma_... items to struct_ref in ModelCIF but we wouldn't be able to modify db_name and would need to rely on "conventions" instead of a proper definition of the possible DBs.

So in the end we decided to keep ma_target_ref_db_details and add struct_ref in ModelCIF files to be compatible with PDB and PDB-Dev. I'll forward you the email thread we had with Brinda about it...

Btw: currently ma_target_ref_db_details also doesn't allow to have multiple entries pointing to the same DB which is a bad restriction. That will be fixed with https://github.com/ihmwg/ModelCIF/issues/18.

In terms of implementation, I am also happy to have it all in one object instead of duplicating things. Could modelcif.reference.TargetReference just inherit from ihm.reference.Sequence? It could fill everything from ihm.reference.Sequence with sequence set to None (which seems acceptable if I read the _StructRefDumper code correctly). Then if one wants to fill the rest of ihm.reference.Sequence, one can do that after construction (which is anyway needed to fill the alignments list).

benmwebb commented 6 months ago

In terms of implementation, I am also happy to have it all in one object instead of duplicating things. Could modelcif.reference.TargetReference just inherit from ihm.reference.Sequence? It could fill everything from ihm.reference.Sequence with sequence set to None (which seems acceptable if I read the _StructRefDumper code correctly). Then if one wants to fill the rest of ihm.reference.Sequence, one can do that after construction (which is anyway needed to fill the alignments list).

Yes, I think that would work. I'll take a look.