opencobra / memote

memote – the genome-scale metabolic model test suite
https://memote.readthedocs.io/
Apache License 2.0
123 stars 26 forks source link

'useless' genes in GPR rules #533

Open tpfau opened 5 years ago

tpfau commented 5 years ago

Problem description

This is nothing directly to do with memote directly, but since memote aims at improving model quality, I think it will reach the right people here. When trying to make GPR association parsing more efficient, I stumled over several GPR association rules (in different models) which had structures like: (A & B) | (A & B & C) Quite obviously C in this instance is irrelevant, as A & B is sufficient and necessary for this formula to evaluate to true. I assume these kinds of GPR associations come from a complex that "can" be formed and that can catalyse the respective reaction with or without an additional protein C (probably with a different efficiency when C is part of the complex). However, to use this information, it would be necessary to, instead of a GPR association only have a PR association, and have individual distinct proteins for A & B and A & B & C, which have a Gene-Protein Relationship. Purely using the GPR rule will (for all I can imagine) never allow to make a distinction between the two, at least on a boolean level, and could actually lead to C being entirely removed from the rule if a logic parser or algorithm is used to reduce/normalize the formula. Personally, I think that this kind of situation should be avoided and that we might actually need defined Proteins (even if we don't have an ID from databases that we can link to them) that form a PR in models and GP rules building the Proteins, to allow making the distinction here. What is your take on this situation?

Code Sample

NA

Context

NA

Midnighter commented 5 years ago

Thank you for your thoughts Thomas. In general, this seems an issue that I'd rather tackle on the cobrapy side. To summarize my thoughts: Other than wanting to improve GPR handling, I'd stick to how things are done currently.

While I agree that C adds no further information to the Boolean rule, it does actually play a role when mapping transcriptomics data onto the model. So I think we would do our users a disservice by simplifying GPRs.

I agree that having GP and PR associations separately would be great but I'd rather see that change being pushed to SBML first.

You might also be interested in #209.

tpfau commented 5 years ago

I perfectly agree, that removing it is problematic, and yes, I will suggest it for a v4 of the FBC package (I doubt it will go into v3).

And wrt mapping transcriptomics at least when not just looking for associations: I can only imagine very odd mapping schemes in which it would have an effect (except if the GPR rule gets essentially processed into GP and PR rules). The common min(and)/max(or) or min(and)/sum(or) schemes would not be influenced by the gene. And I would think any mathematical operation besides min for and as kind of odd. The rule says its a "required" gene for that part, so no more activity than what is allowed by this part. What kind of mapping were you thinking about?

Midnighter commented 5 years ago

Saying mapping here was a bit misleading. I meant rather visualization of omics data on pathways. We sometimes use GPRs to show transcript fold changes on pathways, for example, and we show the fold change of each individual transcript. If C was measured, I certainly wanted to see that and not silently be ignored because it cannot be "mapped".

tpfau commented 5 years ago

Sure, and as I said, I agree that removing it is not good. I just see the problem that e.g. when "normalizing" a GPR into e.g. DNF or something else, this can just happen quite easily, and at the same time, these types of transformations are necessary to use meaningful activity mapping methods. Thats why I wanted to have some more opinions on what to do to address these oddities in the future, and I personally think that moving from the GPR into a two step system might be the cleaner approach.

Midnighter commented 5 years ago

Thats why I wanted to have some more opinions on what to do to address these oddities in the future, and I personally think that moving from the GPR into a two step system might be the cleaner approach.

You definitely have my support for this and I can raise my voice on the FBC mailing list if you feel like pushing the issue.

In the meantime, I suggest the following approach: Convert GPR associations in cobrapy into proper sympy expressions but leave them otherwise unchanged. When they are sympy expression any tool that requires normalization or other transformations can then rely on the sympy toolbox to do so.