ihmwg / python-modelcif

Python package for handling ModelCIF mmCIF and BinaryCIF files
MIT License
10 stars 1 forks source link

Allow lists for _ma_software_parameter #19

Closed bienchen closed 2 years ago

bienchen commented 2 years ago

Some modelling pipelines have lists as parameter values, like ColabFold's model_order. In _ma_software_parameter, I think that should go to _ma_software_parameter.data_type other and then something like _ma_software_parameter.data_type_other_details list of <integer|string|boolean|float>. This could be automatised, I almost got there myself but the translation of boolean values from True to YES failes, which seems to be at the core of python-ihm ... here is a code example (modelcif/dumper.py):

    def dump_parameters(self, system, writer):
        parameter_id = itertools.count(1)
        type_map = {int: "integer", float: "float", str: "string",
                    bool: "boolean", list: "other"}
        with writer.loop(
                "_ma_software_parameter",
                ["parameter_id", "group_id", "data_type",
                 "data_type_other_details", "name", "value",
                 "description"]) as lp:
            for g in self._param_groups:
                group_id = self._param_group_id[id(g)]
                for p in g:
                    if isinstance(p.value, list):
                        if len(p.value) > 0:
                            data_type_other_details = "list of " + \
                                       f"{type_map.get(type(p.value[0]), str)}"
                        else:
                            data_type_other_details = "list"
                    else:
                        data_type_other_details = None

                    lp.write(parameter_id=next(parameter_id),
                             group_id=group_id,
                             data_type=type_map.get(type(p.value), str),
                             name=p.name, value=p.value,
                             description=p.description,
                             data_type_other_details=data_type_other_details)

Could that functionality be added or is it a better idea to let developers fill _ma_software_parameter.data_type_other_details themselves?

benmwebb commented 2 years ago

Of course mmCIF lets us dump any old string data so we could certainly handle this in python-modelcif. Essentially it will just use Python's string representation of a list here, so your code looks OK to me (except for the f-string; that won't work in Python 2). I'd be reluctant to establish a precedent for doing that though since the format of these lists isn't defined in the dictionary. I think it would be better to either add support in the dictionary itself for such lists, or allow for something a bit more standardized, for example a JSON string. The latter would allow for more complex data structures too of course. Although my concern is, if you're trying to put that many parameters in the file, maybe an associated file might be a better choice. Hopefully @brindakv can opine once she's back in action.

BTW, I'm always wary of using str(x) to output more complex Python data structures to files because

  1. You might end up with junk like <ihm.System object at 0x1045e4e50> which you can't read back in, of course.
  2. You need to be a bit careful reading the file back in, for example avoiding the temptation to say x = eval(file_contents) which could result in executing arbitrary Python code.
bienchen commented 2 years ago

Maybe JSON as type in _ma_software_parameter would be the better idea, here. Then an associated file makes sense.That means figuring out how to get this into the parameter list.

brindakv commented 2 years ago

@bienchen could you provide an example of ColabFold's model_order list or other software parameter lists? Although _ma_software_parameter.value can accept a list, I don't think it was meant to be used that way. Associated file is an option. However, it is not straightforward to link it to the software parameter table. What would _ma_software_parameter.value (mandatory) correspond to in this case?

bienchen commented 2 years ago

Here is a complete config.json from ColabFold: { "num_queries": 1, "use_templates": false, "use_amber": false, "msa_mode": "MMseqs2 (UniRef+Environmental)", "model_type": "AlphaFold2-multimer-v2", "num_models": 5, "num_recycles": 3, "model_order": [ 3, 4, 5, 1, 2 ], "keep_existing_results": true, "rank_by": "multimer", "pair_mode": "unpaired+paired", "host_url": "https://api.colabfold.com", "stop_at_score": 100, "recompile_padding": 1.1, "recompile_all_models": true, "commit": "b532e910b15434f707f0b7460abc25c70fcb9b26", "version": "1.2.0" }

Model order is a list of integers.

I think instead of an associated file, in this case having a data type json would be the simplest solution.

benmwebb commented 2 years ago

@brindakv and I discussed this today and agreed to extend the ma_software_parameter.data_type to include simple comma-separated lists of integers or floats, as others within PDB are reluctant to allow JSON in values.

brindakv commented 2 years ago

@bienchen @benmwebb This has been addressed in the latest ModelCIF update.