opencobra / cobrapy

COBRApy is a package for constraint-based modeling of metabolic networks.
http://opencobra.github.io/cobrapy/
GNU General Public License v2.0
461 stars 216 forks source link

fix: avoid scrambling of group members #923

Open BenjaSanchez opened 4 years ago

BenjaSanchez commented 4 years ago

Problem description

When going through an I/O cycle with a model that has subsystems stored as groups, the members of each group in the sbml file get scrambled on each cycle, as those members are stored in set() structures, which are unordered. This significantly hinders cobrapy to be used for updating a model that has groups and is stored in a Github repository, as hundreds of lines in the sbml file will change order each time the model is saved, even if nothing in the model changes.

Ways in which this could be solved:

  1. Store members in lists instead of sets: more consistent with the rest of the model objects (mets/rxns/etc.), but could break backwards compatibility with any code out there that directly manipulates members.
  2. Sort the members of each group when saving the model as sbml: no BC loss, as the member structures would remain as sets. However, it would not respect any eventual predefined order by the user, as the sorting would override it.
  3. Store members in an OrderedSet: a nice middle-ground between the two previous options, but would require an additional dependency.
  4. Probably others?

Which way should be preferred? I'm happy to implement the fix, but first recommendations/thoughts from the community would be appreciated :)

Code Sample

For a model that has groups:

import cobra
model = cobra.io.read_sbml_model("model.xml")
write_sbml_model(model, "model.xml")

Actual Output

Example for one of the groups:

image

Expected Output

No changes to the model.

Dependency Information

System Information ================== OS Windows OS-release 10 Python 3.7.5 Package Versions ================ cobra 0.17.1 depinfo 1.5.1 future 0.18.2 numpy 1.17.4 optlang 1.4.4 pandas 0.25.3 pip 19.3.1 python-libsbml-experimental 5.18.0 ruamel.yaml 0.16.5 setuptools 42.0.1.post20191125 six 1.13.0 swiglpk 4.65.0 wheel 0.33.6
gregmedlock commented 4 years ago

I think sorting on SBML output is the best option if we wanted to solve this, but SBML is also not intended to be human readable. Most software for reading/writing SBML files operates under this convention; order of entries within any "listOfX" is not part of the SBML specification. Basically every container in cobrapy is unordered for efficiency, e.g., removing reactions and adding them back to the model in a different order will also lead to a git diff with many lines that are changed.

It is unfortunate that there isn't a reliable way to track changes with a text-based diff. I am of the opinion that these changes are best tracked using the version history of the files that make modifications to the model and encode the logic for each change, not the model itself (e.g., your scripts for constructing/curating the model and/or files with evidence for each piece of curation).

cdiener commented 4 years ago

What about the DictList that's used for reactions and metabolites? Might be beneficial to use the same type for all containers... I agree that diffing SBML will remain challenging but there were discussions that models should allow a round trip on saving/loading as this makes testing easier.

gregmedlock commented 4 years ago

Sounds good to me. While we're at it, reaction.genes is currently a set as well... I frequently find myself trying to access genes within a reaction, e.g. reaction.genes[0], which doesn't work since sets aren't subscriptable. Might be worth unifying reaction.genes to be a DictList too, rather than set/frozenset.

Midnighter commented 4 years ago

DictList is a very decent choice but it does break backwards compatibility since it has a different interface than set. I'm not sure how many projects modify group members directly.

BenjaSanchez commented 4 years ago

Thanks for your inputs!

I am of the opinion that these changes are best tracked using the version history of the files that make modifications to the model and encode the logic for each change, not the model itself.

what about tracking both the model and associated scripts? in yeast-GEM (which uses cobratoolbox) this works well in most PRs, with very occasional scrambling of the sbml file, for which additional model formats as .yml and .txt are provided (e.g. https://github.com/SysBioChalmers/yeast-GEM/pull/131).

diffing SBML will remain challenging but there were discussions that models should allow a round trip on saving/loading as this makes testing easier.

Another big problem is branching: if some fields get scrambled on every save, creating 2 branches for 2 different collaborators to work in the model will lead to hundreds of conflicts.

DictList is a very decent choice but it does break backwards compatibility since it has a different interface than set. I'm not sure how many projects modify group members directly.

DicList sounds like a good solution. I'm happy to work on this next week, unless this break of BC would affect many people in the community.

cariaso commented 4 years ago

Backwards compatibility could be improved by adding

s.add(x) -> s.append(x) s.remove(x) -> del s[x]

and possibly all of ['and', 'cmp', 'iand', 'ior', 'isub', 'ixor', 'or', 'rand', 'ror', 'rsub', 'rxor', 'sub', 'xor', 'add', 'clear', 'copy', 'difference', 'difference_update', 'discard', 'intersection', 'intersection_update', 'isdisjoint', 'issubset', 'issuperset', 'symmetric_difference', 'symmetric_difference_update', 'union', 'update']

to throw a warning + traceback + exception.

So that no one gets surprised by a silent change.

matthiaskoenig commented 4 years ago

You should consider to update to python 3.6 or above, which preserves dict insert order as well as set insert order. This should solve your issue. For now this is only an implementation detail (and could change in future python version), so don't rely on it. But it should solve your issue.

BenjaSanchez commented 4 years ago

I have opened PR #927 with the change to DictList implemented for group members. If the implementation looks good I can extend it to reaction.genes as well @gregmedlock

@matthiaskoenig I believe sets are still not ordered in python 3.6: https://stackoverflow.com/questions/45581901/are-sets-ordered-like-dicts-in-python3-6

gregmedlock commented 4 years ago

PR #927 looks good to me, but I'm not sure whether it's also worth breaking backwards compatibility in reaction.genes yet, since there is a desire to improve handling of GPRs, e.g., Issue #739 which would likely break backwards compatibility again/in other ways.

Perhaps we should discuss further in #739?