ihmwg / python-modelcif

Python package for handling ModelCIF mmCIF and BinaryCIF files
MIT License
10 stars 1 forks source link

Issues in validation and make-mmcif #30

Closed gtauriello closed 1 year ago

gtauriello commented 1 year ago

I did some more tests for the combination of make-mmcif.py and validate_mmcif.py and observed the following issues:

benmwebb commented 1 year ago

When testing on 1crn.cif, I observe that the validation gives me errors such as: The following keywords are not defined in the dictionary: _entity_src_gen.entity_id, ....

python-ihm's dictionary parser was expecting data items to be defined after their category, but that's not the case for entity_src_gen. See ihmwg/python-ihm#107.

An exception in make-mmcif for the version with ligands (model_01_fix_maxit.cif): AttributeError: 'NoneType' object has no attribute 'lower'

This is because that file has pdbx_entity_nonpoly and python-modelcif expects it to contain a mandatory ma_model_mode data item. But we can handle that. See #31.

An exception in validate_mmcif (after running make-mmcif) for the version without ligands (model_01_fix_nolig_maxit.cif): Mandatory keyword ma_data.name cannot have value '?'

I'd argue this is not a bug. ma_data.name is indeed mandatory in ModelCIF, and python-modelcif makes a best effort to fill it in, in this case, using entity.pdbx_description. But that field is ? in your input file, so it can't manufacture data. It could replace it with . or something generic like "target sequence" to make the validator happy, but really what the validator is saying here is "hey user, you need to put in a name here".

gtauriello commented 1 year ago

Thanks very much for the fixes.

On the last one: I see your point but then we need to make sure to communicate clearly that we cannot guarantee that a mmCIF-compliant file (like the MAXIT-generated one used in this example) can be converted to a ModelCIF-compliant output with make-mmcif (e.g. as a remark in that file or warning produced with the output). I think the example used here (i.e. a plain PDB file with no extra info on the entities) is fairly representative for what we commonly see in the structure prediction field and so some guidelines on how to fix that would be helpful.