Add case- and whitespace-preserving tokenizers

benmwebb commented 1 month ago

It would be useful to be able to carry out simple search and replace style operations on existing mmCIF files, e.g. "change all citation_id 1 to primary". This can be currently by reading in the file, making the change, and then writing it out, but this results in a large diff since categories may end up in a different order, IDs may be reassigned, data items may be added or reordered, and any formatting such as whitespace or comments is lost. The mmCIF reader currently breaks the files into tokens and drops whitespace. Consider storing the original whitespace and capitalization in any tokens, so that we can potentially read in a file and the written-out file will exactly match it.

benmwebb commented 1 month ago

This should be largely addressed by 0390c2a15bec061f775a7c4fda1ffdc54812a4a6.

benmwebb commented 1 month ago

@brindakv Here is a simple Python script to change citation 1 to "primary" using these new code paths. All the classes are undocumented for now but I will clean them up a bit and put them in a documented new module, maybe ihm.token.

Note that while the tokenizer understands mmCIF syntax while preserving case, whitespace and comments, it does not use the dictionary, so we must be careful to catch all tables that reference citation.id. So I would recommend running the output through a validator after such low-level hacking.

import ihm.format

filters = [
    ihm.format._ChangeValueFilter(
        '_citation.id', old='1', new='primary'),
    ihm.format._ChangeValueFilter(
        '.citation_id', old='1', new='primary'),
    ihm.format._ChangeValueFilter(
        '.fitting_method_citation_id', old='1', new='primary')]

with open('PDBDEV_00000030.cif', encoding='latin1') as fh_in:
    r = ihm.format._PreservingCifReader(fh_in)
    with open('out.cif', 'w', encoding='latin1') as fh_out:
        for t in r.read_file(filters=filters):
            fh_out.write(t.as_mmcif())

brindakv commented 1 month ago

Thank you @benmwebb.

ihmwg / python-ihm

Add case- and whitespace-preserving tokenizers #141