Open Fxe opened 4 years ago
@Fxe just import and export your model without any prefix changes. I.e. set the f_replace=None
in the read_sbml_model
and write_sbml_model
functions.
My personal take on this is: Id replacements should be avoided at all costs and result in a ton of follow up bugs because different people expect different things to happen with id replacements. All this id replacement code is in there because some databases and tools are generating strange internal ids. Personally I would just cut all the replacement code which will simplify things.
Not sure what the right behavior is here, but most likely to have a label without prefix. I will have a look at this, but like said above just have models with correct SBML ids and read and write the models without id replacements.
The problem is that, you have to replace the ID just to safe guard that ID is a valid sbml id:
SId pattern: (_|[a-z]|[A-Z])(_|[a-z]|[A-Z]|[0-9])*
(not sure if this is true for fbc but it should just for sake of consistency)
So, yes if I disable the replace function no prefix is added to either fbc:label or fbc:id, but this may cause problems if genes are numeric.
Just some context: the fbc:label is the field that I use to recover the gene identifiers of the model, this would allow me to avoid string manipulations, even if it was a simple: if then prefix removal.
Also taken from http://co.mbine.org/specifications/sbml.level-3.version-1.fbc.version-2.release-1.pdf
The label attribute The primary purpose of a GeneProduct is to uniquely reference a gene or implied gene product. As there is, currently, no restriction on the format of these these references they cannot be assumed to conform to an SBML SId syntax. Therefore the Flux Balance Constraints package defines the required attribute label, of type string, for this purpose.
My suggestion is to set gp.setLabel(cobra_gene.id)
instead of gp.setLabel(gid)
I agree. label
and id
are two totally different things. That is why it does not make any sense at all to try to create one from the other by arbitrary very subjective replacement rules. This is separate information which should be stored separately with no connection whatsoever. It is impossible to figure out from an arbitrary string which transformations are expected to get to the label.
To clarify my point: Assume I use the G_*
syntax to uniquely identify my genes via the label and you use a label without G_*
prefix. So changing the behavior would break my hypothetical labels. There is no right or wrong here, but arbitrary replacements are worse then not replacing anything.
Basically there should be a check if a label attribute is set on the cobra gene which could be stored in the label field. This could be easily added and then one could store the label on the gene.label
. This is easy to support for reading and writing and will solve your issue.
Sorry forgot: Yes, the following makes sense
My suggestion is to set gp.setLabel(cobra_gene.id) instead of gp.setLabel(gid)
But does not solve the underlying issue that label
is something different then id
or name
and should be stored separately if there should be no information loss.
I don't remember the details now but @BenjaSanchez recently worked on gene labels as he's trying to make the SBML output from the COBRA Toolbox and cobrapy be the same.
@Midnighter my issue #950 has to do with gene names and the fact that cobratoolbox stores them as label
(but not as name
), making it incompatible with cobrapy when the model is saved with one tool and loaded with the other. But IMO this is bad behavior on the cobratoolbox side, as using gene ids makes more sense.
@all I will tackle this issue within the week.
The clean way is to store and retrieve id
, name
and label
for genes without messing/mixing these up. This will preserve the information. In case there is no label
set on a gene we will fall back on the gene.id without any replacements, i.e. no prefix additions as suggested by @Fxe (and in my opinion the correct behavior if no label is set).
https://github.com/opencobra/cobrapy/blob/4129a13b0055477e85bec5c0c4e0cf72b4641c7c/cobra/io/sbml.py#L1055
Is this the expected outcome, that the gene fbc:label is equal to fbc:id ? Because the ID is mutated with prefix "G_" the label is also mutated, looking back at the models in BiGG (e.g., http://bigg.ucsd.edu/static/models/iML1515.xml) The ID is mutated with the prefix "format" (Gxxxx) while Label is the original ID without "G"
import and export the iML1515.xml:
From BiGG (iML1515.xml):
<fbc:geneProduct fbc:id="G_b0067" fbc:label="b0067" fbc:name="thiP" metaid="G_b0067" sboTerm="SBO:0000243">
Exported from cobra
<fbc:geneProduct metaid="meta_G_b0067" sboTerm="SBO:0000243" fbc:id="G_b0067" fbc:name="thiP" fbc:label="G_b0067">