opencobra / cobrapy

COBRApy is a package for constraint-based modeling of metabolic networks.
http://opencobra.github.io/cobrapy/
GNU General Public License v2.0
461 stars 216 forks source link

Gene Label is set equal to ID #963

Open Fxe opened 4 years ago

Fxe commented 4 years ago

https://github.com/opencobra/cobrapy/blob/4129a13b0055477e85bec5c0c4e0cf72b4641c7c/cobra/io/sbml.py#L1055

Is this the expected outcome, that the gene fbc:label is equal to fbc:id ? Because the ID is mutated with prefix "G_" the label is also mutated, looking back at the models in BiGG (e.g., http://bigg.ucsd.edu/static/models/iML1515.xml) The ID is mutated with the prefix "format" (Gxxxx) while Label is the original ID without "G"

import and export the iML1515.xml:

From BiGG (iML1515.xml): <fbc:geneProduct fbc:id="G_b0067" fbc:label="b0067" fbc:name="thiP" metaid="G_b0067" sboTerm="SBO:0000243">

Exported from cobra <fbc:geneProduct metaid="meta_G_b0067" sboTerm="SBO:0000243" fbc:id="G_b0067" fbc:name="thiP" fbc:label="G_b0067">

matthiaskoenig commented 4 years ago

@Fxe just import and export your model without any prefix changes. I.e. set the f_replace=None in the read_sbml_model and write_sbml_model functions. My personal take on this is: Id replacements should be avoided at all costs and result in a ton of follow up bugs because different people expect different things to happen with id replacements. All this id replacement code is in there because some databases and tools are generating strange internal ids. Personally I would just cut all the replacement code which will simplify things.

Not sure what the right behavior is here, but most likely to have a label without prefix. I will have a look at this, but like said above just have models with correct SBML ids and read and write the models without id replacements.

Fxe commented 4 years ago

The problem is that, you have to replace the ID just to safe guard that ID is a valid sbml id: SId pattern: (_|[a-z]|[A-Z])(_|[a-z]|[A-Z]|[0-9])* (not sure if this is true for fbc but it should just for sake of consistency)

So, yes if I disable the replace function no prefix is added to either fbc:label or fbc:id, but this may cause problems if genes are numeric.

Just some context: the fbc:label is the field that I use to recover the gene identifiers of the model, this would allow me to avoid string manipulations, even if it was a simple: if then prefix removal.

Also taken from http://co.mbine.org/specifications/sbml.level-3.version-1.fbc.version-2.release-1.pdf

The label attribute The primary purpose of a GeneProduct is to uniquely reference a gene or implied gene product. As there is, currently, no restriction on the format of these these references they cannot be assumed to conform to an SBML SId syntax. Therefore the Flux Balance Constraints package defines the required attribute label, of type string, for this purpose.

My suggestion is to set gp.setLabel(cobra_gene.id) instead of gp.setLabel(gid)

matthiaskoenig commented 4 years ago

I agree. label and id are two totally different things. That is why it does not make any sense at all to try to create one from the other by arbitrary very subjective replacement rules. This is separate information which should be stored separately with no connection whatsoever. It is impossible to figure out from an arbitrary string which transformations are expected to get to the label. To clarify my point: Assume I use the G_* syntax to uniquely identify my genes via the label and you use a label without G_* prefix. So changing the behavior would break my hypothetical labels. There is no right or wrong here, but arbitrary replacements are worse then not replacing anything.

Basically there should be a check if a label attribute is set on the cobra gene which could be stored in the label field. This could be easily added and then one could store the label on the gene.label. This is easy to support for reading and writing and will solve your issue.

matthiaskoenig commented 4 years ago

Sorry forgot: Yes, the following makes sense

My suggestion is to set gp.setLabel(cobra_gene.id) instead of gp.setLabel(gid)

But does not solve the underlying issue that label is something different then id or name and should be stored separately if there should be no information loss.

Midnighter commented 4 years ago

I don't remember the details now but @BenjaSanchez recently worked on gene labels as he's trying to make the SBML output from the COBRA Toolbox and cobrapy be the same.

BenjaSanchez commented 4 years ago

@Midnighter my issue #950 has to do with gene names and the fact that cobratoolbox stores them as label (but not as name), making it incompatible with cobrapy when the model is saved with one tool and loaded with the other. But IMO this is bad behavior on the cobratoolbox side, as using gene ids makes more sense.

matthiaskoenig commented 4 years ago

@all I will tackle this issue within the week. The clean way is to store and retrieve id, name and label for genes without messing/mixing these up. This will preserve the information. In case there is no label set on a gene we will fall back on the gene.id without any replacements, i.e. no prefix additions as suggested by @Fxe (and in my opinion the correct behavior if no label is set).