opencobra / schema

xml/rdf schemas for annotating cobra models
Apache License 2.0
2 stars 1 forks source link

SBML ID to Model ID #6

Open tpfau opened 6 years ago

tpfau commented 6 years ago

Hi, I'm not entirely sure, if this is the right place to discuss this. if not it would be great if someone could point me to the appropriate place.

I'm currently thinking about SBML IO and SBML Identifiers. At the moment, in the COBRA Toolbox we have the following convention: We convert from internal ID to SBMLID by replacing all elements that are not a-zA-Z0-9_ by __Num__ where Num is the string representing the original char cast to an integer. We further check, whether there are any ids which start with a number. If those exist, we append a marker in the beginning depending on the type of id (M for metabolites, R for reactions, C for compartments and G for genes to ALL IDs of that type, i.e. we assume, that when reading a file that has all starting with this expression, those are files generated by us).

Metabolites in the matlab structure commonly contain the compartment as part of the name. (e.g. atp[c]), and this is actually the only place, where the compartment is stored. So when exporting, we extract the info and put it into compartments. The identifier is completely kept, i.e. converted to "atp91c91", as we have to ensure, that the same compound in different compartments does not get the same id.

One more peculirarity: We currently have the following concept for GeneProducts: id is defined as above (with a conversion of what is in model.genes) label is the gene ID itself as stored in our model.genes field (this is commonly some database identifier). name is storing protein names for a specific gene product.

What I'm wondering: how is cobraPy doing this? And are there suggestions/ideas how this "should" be done?

matthiaskoenig commented 6 years ago

As far as I remember is also changing ids, but no expert on this.

Just wanted to comment on the general praxis of replacing identifiers in COBRA and COBRApy. This is very bad practice and should not be done. Identifiers are there to identify things and to map things on them. To replace identifiers by arbitrary rules should never be done, and it would be great if this could removed from COBRA and COBRApy. It does not matter for anything if an id starts with R or M, because it is clear if it is a metabolite or reaction. Users choose ids for some reason to change them on import is very bad praxis. Basically I have to figure out what is replaced to unreplace things to be able to map flux distributons on SBML networks, because internally some arbitrary convention is used in cobra. Please don't do this if possible.

tpfau commented 6 years ago

@matthiaskoenig We are replacing IDs for SBML export where the ID does not conform to the specifications of an SBML ID (i.e. matching the regexp "_|[a-zA-Z][_0-9a-ZA-Z]*"). The other alternative would be to have "met1 ... metN" as IDs, store the actual id as name, and annotate the name in some other way. However, that would break many old models during IO (the new software would not read them in correctly since it would be assuming IDs are now stored in label and Names somewhere else and thus leading to odd models.

matthiaskoenig commented 6 years ago

@tpfau I understand backwards compatibility. But as a user I want to load an SBML model, run a simulation and have the resulting flux dictionaries which map on the original identifiers. Just because some old models did thing differently does not mean a simple task like this should currently not possible. It cannot be that I have to read the source code and figure out which replacements were done to the identifiers to be able to map my fluxes to my original models. But this is exactly what I have to do with COBRA(py)

matthiaskoenig commented 6 years ago

Here things which are happening in cobra.io.sbml3.py, to get back to your question.

Just arbitrarly clippling prefixes because why not. So basically COBRA is just arbitrarly appending stuff, cobrapy arbitrarily clipping.

...
 # add metabolites
    for species in xml_model.findall(SPECIES_XPATH % 'false'):
        met = get_attrib(species, "id", require=True)
        met = Metabolite(clip(met, "M_"))
        ...

# add genes
    for sbml_gene in xml_model.iterfind(GENES_XPATH):
        gene_id = get_attrib(sbml_gene, "fbc:id").replace(SBML_DOT, ".")
        gene = Gene(clip(gene_id, "G_"))
  ...
    for sbml_reaction in xml_model.iterfind(
            ns("sbml:listOfReactions/sbml:reaction")):
        reaction = get_attrib(sbml_reaction, "id", require=True)
        reaction = Reaction(clip(reaction, "R_"))
    ....

# string utility functions
def clip(string, prefix):
    """clips a prefix from the beginning of a string if it exists

    >>> clip("R_pgi", "R_")
    "pgi"

    """
    return string[len(prefix):] if string.startswith(prefix) else string
matthiaskoenig commented 6 years ago

Ahhh. And arbitrarly appending again at the export. Are you kidding me.

# add in metabolites
    species_list = SubElement(xml_model, "listOfSpecies")
    for met in cobra_model.metabolites:
        species = SubElement(species_list, "species",
                             id="M_" + met.id,
                             # Useless required SBML parameters
                             constant="false",
...
tpfau commented 6 years ago

@matthiaskoenig : Again, this is not arbitrary. SBMLIDs have to fulfill a specific regexp, and a lot of ids don't adhere to that, so to make them valid SBML IDs you have to adjust them. In the Toolbox we aim to only do these changes if they are necessary to adhere to the SBML specification (e.g. if metabolites start with a number like 7thf, than that ID is invalid for sbml).

I agree, that these changes are "not nice", and I have had enough trouble with them myself, but I don't see how they can be avoided without annoying so many users.

But in the end, it is impossible to export non conforming IDs, at least not as ID, and if you want to match, you can still use the name attribute.

rmtfleming commented 6 years ago

Hi all,

if there is not already, will there ever be support for characters like [ ] in the identifiers?

Are the characters in InChI supported? https://iupac.org/who-we-are/divisions/division-details/inchi/

Why is there a need for constraints on the characters that make up an SBML ID in the first place?

Regards,

Ronan

On 17 October 2017 at 13:33, Matthias König notifications@github.com wrote:

@tpfau https://github.com/tpfau I understand backwards compatibility. But as a user I want to load an SBML model, run a simulation and have the resulting flux dictionaries which map on the original identifiers. Just because some old models did thing differently does not mean a simple task like this should currently not possible. It cannot be that I have to read the source code and figure out which replacements were done to the identifiers to be able to map my fluxes to my original models. But this is exactly what I have to do with COBRA(py)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/6#issuecomment-337216582, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDCOsz4pTO_lmst7IvxAE2b3fePCea9ks5stJ6QgaJpZM4P7x1d .

-- --

Mr. Ronan MT Fleming B.V.M.S. Dip. Math. Ph.D.

Senior research associate (EN) == Chercheur (FR), Principal investigator, Systems Biochemistry Group, wwwen.uni.lu/lcsb/research/systems_biochemistry Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, 6, avenue du Swing, L-4367 Belvaux. & National Centre of Excellence in Research on Parkinson’s disease www.parkinson.lu & Adjunct Assistant Professor, Division of Analytical Biosciences, Leiden Academic Centre for Drug Research, Faculty of Science, University of Leiden. http://analyticalbiosciences.leidenuniv.nl

An introduction to Parkinson's research with stem cells in Luxembourg (5 min video) https://www.youtube.com/watch?v=drOWKyTL0K8

Mobile: +352 621 175 112 Office: +352 466 644 5528 Skype: ronan.fleming

(This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies.)

matthiaskoenig commented 6 years ago

You are completely free in the characters on the metaid in SBML, so if one wants to use very complex patterns one can use these on the metaId, like inchi and so on. Ids should be a short string which works well with most computer languages, i.e. not contain signs like '-' or start with numbers. In my experience this is a general rule which is respected for ids by most databases and models with very few exceptions.

M

On Tue, Oct 17, 2017 at 3:30 PM, Ronan M.T. Fleming < notifications@github.com> wrote:

Hi all,

if there is not already, will there ever be support for characters like [ ] in the identifiers?

Are the characters in InChI supported? https://iupac.org/who-we-are/divisions/division-details/inchi/

Why is there a need for constraints on the characters that make up an SBML ID in the first place?

Regards,

Ronan

On 17 October 2017 at 13:33, Matthias König notifications@github.com wrote:

@tpfau https://github.com/tpfau I understand backwards compatibility. But as a user I want to load an SBML model, run a simulation and have the resulting flux dictionaries which map on the original identifiers. Just because some old models did thing differently does not mean a simple task like this should currently not possible. It cannot be that I have to read the source code and figure out which replacements were done to the identifiers to be able to map my fluxes to my original models. But this is exactly what I have to do with COBRA(py)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/6#issuecomment-337216582, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDCOsz4pTO_ lmst7IvxAE2b3fePCea9ks5stJ6QgaJpZM4P7x1d .

--

-- Mr. Ronan MT Fleming B.V.M.S. Dip. Math. Ph.D.


Senior research associate (EN) == Chercheur (FR), Principal investigator, Systems Biochemistry Group, wwwen.uni.lu/lcsb/research/systems_biochemistry Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, 6, avenue du Swing, L-4367 Belvaux. & National Centre of Excellence in Research on Parkinson’s disease www.parkinson.lu & Adjunct Assistant Professor, Division of Analytical Biosciences, Leiden Academic Centre for Drug Research, Faculty of Science, University of Leiden. http://analyticalbiosciences.leidenuniv.nl


An introduction to Parkinson's research with stem cells in Luxembourg (5 min video) https://www.youtube.com/watch?v=drOWKyTL0K8


Mobile: +352 621 175 112 <+352%20621%20175%20112> Office: +352 466 644 5528 Skype: ronan.fleming


(This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/6#issuecomment-337231634, or mute the thread https://github.com/notifications/unsubscribe-auth/AA29uqffX6dvlG9F-atmny8nqybgEDY-ks5stKvrgaJpZM4P7x1d .

-- Dr. Matthias König Junior Group Leader LiSyM - Systems Medicine of the Liver Humboldt-University Berlin, Institute of Biology, Institute for Theoretical Biology https://www.livermetabolism.com konigmatt@googlemail.com Tel: +49 30 20938450 Tel: +49 176 81168480

tpfau commented 6 years ago

In my experience this is a general rule which is respected for ids by most databases and models with very few exceptions.

@matthiaskoenig as for the IDs in SBML, they are VERY restrictive, and there are few database ids which do only have IDs that match the pattern. KEGG: yes because it has ids which are not human readable at all. MetaCyc: No, it uses "-" in its IDs a lot. BiGG: No, lots of metabolites starting with numbers. MetaNetX: Yes, again IDs are non human readable.

Essentially, any Database I have seen, which uses IDs that are not just consecutive numbers of some form have IDs which are forbidden as ID in SBML and would need to be converted. Yes, we could use the metaid to provide the ID. But that would still leave the ID "different".

matthiaskoenig commented 6 years ago

Yes, there are definitely issues.

But BiGG database ids are not SBML conform but is clearly state how the SBML model ids should look, so the transformation from database to model id is clear https://github.com/SBRG/bigg_models/wiki/BiGG-Models-ID-Specification-and-Guidelines

MetaCyc should provide a respective document if they expect their ids being used as sbml ids.

On Thu, Oct 19, 2017 at 8:20 AM, Thomas Pfau notifications@github.com wrote:

In my experience this is a general rule which is respected for ids by most databases and models with very few exceptions.

@matthiaskoenig https://github.com/matthiaskoenig as for the IDs in SBML, they are VERY restrictive, and there are few database ids which do only have IDs that match the pattern. KEGG: yes because it has ids which are not human readable at all. MetaCyc: No, it uses "-" in its IDs a lot. BiGG: No, lots of metabolites starting with numbers. MetaNetX: Yes, again IDs are non human readable.

Essentially, any Database I have seen, which uses IDs that are not just consecutive numbers of some form have IDs which are forbidden as ID in SBML and would need to be converted. Yes, we could use the metaid to provide the ID. But that would still leave the ID "different".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/6#issuecomment-337810693, or mute the thread https://github.com/notifications/unsubscribe-auth/AA29ugxPbrO30vXjKIvKaNZbX0vKzoScks5stuotgaJpZM4P7x1d .

-- Dr. Matthias König Junior Group Leader LiSyM - Systems Medicine of the Liver Humboldt-University Berlin, Institute of Biology, Institute for Theoretical Biology https://www.livermetabolism.com konigmatt@googlemail.com Tel: +49 30 20938450 Tel: +49 176 81168480

tpfau commented 6 years ago

Well, these would be "clear" if the IDs were unambigous. i.e. there are things like 26dap_LL_c, which according to their spec, could mean different things (tissue c in compartment LL or just compartment c, or no compartment info at all).

However, in essence our output is also clear: The ids are IDs made compatible to SBML with the following conversions: Addition of R,M,G,C for reactions, metabolites, genes and compartments respectively; Replacement of non-SBML conform characters by __ASCII(Char)__. The same is applied when reading. So yes, (as with BiGG) there could be instances, where we get ambigous, if we have __[0-9]+__ in our original ID, but apart from these instances (that I personally never encountered), we are not.

tpfau commented 6 years ago

Actually this is part of the reason, why I did bring this up, because we do need some specified mechanism, and it would (imo) be good if all COBRA projects can use the same mechanism.

Midnighter commented 6 years ago

Actually this is part of the reason, why I did bring this up, because we do need some specified mechanism, and it would (imo) be good if all COBRA projects can use the same mechanism.

Strongly agree with that point. I'm pretty unopinionated when it comes to the actual implementation because I think BiGG identifiers are a mess anyway so I'm happy to push for implementation in cobrapy whatever comes out of this discussion.

tpfau commented 6 years ago

One of my main issues is backward compatability. The "easiest" and probably cleanest solution would be to only ever use the metaid to store the actual id (as this is pretty flexible). But even that ID has restrictions, which woudl need to be covered. One problem I see with doing so is that most tools rely on the ID being something sensible, and the meta-ID often not even being looked at. And this would be only "us" using the IDs that way, so anything that comes from external sources would not necessarily be compatible.

matthiaskoenig commented 6 years ago

That sounds like a great idea. I.e., create a general guideline how to treat internal identifiers when exporting them to SBML. Basically everything which is already a valid sbml ID should be unchanged, for all the other issues a general guideline should exist. It would be great if this could be applied in the COBRA projects, but even better in the general constrainted based modeling community.

On Thu, Oct 19, 2017 at 12:48 PM, Thomas Pfau notifications@github.com wrote:

Actually this is part of the reason, why I did bring this up, because we do need some specified mechanism, and it would (imo) be good if all COBRA projects can use the same mechanism.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/6#issuecomment-337871539, or mute the thread https://github.com/notifications/unsubscribe-auth/AA29umIMVZdJn2RhZ0hxuKnz1F1layNUks5stykMgaJpZM4P7x1d .

-- Dr. Matthias König Junior Group Leader LiSyM - Systems Medicine of the Liver Humboldt-University Berlin, Institute of Biology, Institute for Theoretical Biology https://www.livermetabolism.com konigmatt@googlemail.com Tel: +49 30 20938450 Tel: +49 176 81168480

tpfau commented 6 years ago

Basically everything which is already a valid sbml ID should be unchanged,

This might be problematic. I actually prefer additional modifications if they are consistent throughout a model. E.g. while a general replacement scheme (as I suggested above and currently used in the COBRA Toolbox) is generally applicable, this is not true for special situation e.g. the starting characters of an ID. If something does not fit to the SBML-ID rules because of the starting characters, I would prefer having the whole model getting a Prefix, than just a single entity. This way, I can be much more certain that if this prefix is present for all entities, it is an addition because of a problematic individual entity AND this is what used to be done in old models (i.e. it is backward compatible). If we only modify those entities where a problem occurs in this instance, I would feel much less certain, that this is not a chance metabolite that actually has this M_ in its real ID.

draeger commented 6 years ago

The link https://github.com/SBRG/bigg_models/wiki/BiGG-Models-ID-Specification-and-Guidelines provided by @matthiaskoenig is the right document to see how BiGG Models Database and COBRApy do it. In fact, ids are here kind of overloaded with semantic information. This can be done, but as discussed it adds more cumbersome mapping problems to the whole import/export.

According to the SBML specification, the value of an id attribute of an element should actually never be exposed to the user. There is also a name attribute, where you can really have any text. @matthiaskoenig suggested using the metaid. While this works with almost any String, this is also an attribute that is not intended to be exposed to users. Please note that there are restrictions to metaid. In fact, the pattern for the metaid is much more complex than that for the id (see, for instance, SBML specification L2V2R1 p. 12, Section 3.1.6 and the definition of the corresponding symbols at http://www.w3.org/TR/2000/REC-xml-20001006#NT-CombiningChar). So, please do not simply use metaid without checks! Use the name attribute for anything to be displayed to users about the element.

Now, let me answer @rmtfleming's question, why identifiers in SBML have such restrictions and if they will ever be changed? They probably won't be changed. Originally, the restrictions came from the various programing languages, where variables can also usually not start with numbers or contain blanks etc. So you would be able to use the value of an id attribute directly when generating executable source code from an SBML model. There purpose is to give a unique identifier to an object within a specific namespace inside of an SBML document. Note, there are possibly multiple namespaces. For instance, units can have the same id as a species, because they are in different namespaces. So, they are only used internally and not exposed to users. They are variable names, nothing else. You have annotations, you have name etc., so no need to also overload ids.

tpfau commented 6 years ago

Hmmm... The thing with "name" is that there is commonly a name attached to entities in COBRA models (which is more descriptive, and which noone ever wants to type ^^). And here again, we get into backward compatability issues: Most models do use the ID for the shorthand identifiers (modified to adhere to the SBML spec) and use name for the longer descriptive information. So, switching this will make using old models very difficult. Also in almost any publication, when Identifiers are used for reactions/metabolites, those refer to the ID of the SBML element. While I agree, that this is not how it is supposed to be, changing it is almost impossible, as I would claim that most (if not all) tools around expect it.

draeger commented 6 years ago

@tpfau then, maybe the best would really to check the document provided by the BiGG/COBRApy team and see if this works for COBRA Toolbox as well. However, as @matthiaskoenig pointed out, it would be good if identifers that do not mach that scheme could still be accepted upon import by COBRA Toolbox as long as it is some valid ID (in SBML).

draeger commented 6 years ago

And, maybe the metaid could be a place to store COBRA ids indeed. It needs to be checked if the pattern for this attribute allows all characters that you require. The id could then just be whatever, e.g., M1... M9434 or G543 etc.

tpfau commented 6 years ago

@draeger The document (at least the one @matthiaskoenig pointed to is somewhat odd. As mentioned above, this formulation leads to ambiguities as it is unclear whether something is a double localised (tissue/compartment) or single localised, or even unlocalised item, so I don't really want to adapt that specification, as it is hard to extract the proper information. Lets start with what we currently have in (I woudl actually claim) a lot of models and a "common" way to write stuff in the community:
Metabolites tend to look roughly like:
metID[compID], where compID regularily is a single character.
Reactions don't have any clear specification, but there are some things I have seen in multiple models:
EX_metName indicating a metabolite exchanger (However this is very ambigous, as sometime metName is the name including the compartment, sometimes not, sometimes replacing the brackets [] of the compartment part by parenthesis ()
DMmetName (as with EX but a demand)
sink_metName or SK_metName (same as DM_metName).
an ID followed by a lower case t or a TAB_metName often indicate a transporter (the latter from compartment A to compartment B). Compartments:
Commonly a 1 letter abbreviation. But sometimes more than one letter.
Tissues:
I haven't yet seen many but they tend to come as Tissue_entityID, or as part of the compartment ID, and the variant used in the BiGG document was actually new to me.

Now, for metabolites, there is the issue, that a model will have multiple metabolites with the same name but different compartments. So using metID as sbml ID is not possible. That leaves us with a compartment ID, that we have to somehow translate. And here comes the issue: The BiGG/CobraPy document suggests to do this by concatenating it to the global id by _. But as mentioned, this leads to ambiguities. We have the [id] compartment specifier (which we can't change imo), and we currently simply translate this into a valid sBML by the __ASCII(char)__ translation. So we keep the full ID of the metabolite (replacing only invalid characters and adding a prefix). Yes, this becomes less human readable in the SBML. But it is a straightforward regular expression that can be used to convert it back. Also having this tissue information similar to a compartment in the end will (imo) lead to problems. A: This way, the SBML will indicate, that the metabolites are in the very same compartment, and it doesn't know about the tissue. So my personal opinion here is to merge Compartment and Tissue IDs, i.e. have the Compartment/Tissue be one unit. You can have a scheme in the compartment ID or name, that indicates tissue-association, but I wouldn't just put it into the metabolite. And if it is stored in the compartment, the compartment IDs do no longer match the compartment IDs written in the metabolites, so this becomes a mess.

Personally, what I would do: global ID -> In SBML this is essentially the SpeciesType of all Species that are localised in any compartment for mets. For anything else, this is the general ID of the entity. This ID, when exported from COBRA (or BiGG) starts with a prefix for the type of entity (to allow for starting numbers in the ID), and any non SBMLID conforming char is replaced by its ascii number in double underscores . This way, we don't have to restrict the ids in our models at all, and any language can translate the string for output, and use the ID for computation) localID -> This would be the actual SBML Species/Reaction/GeneProduct with a suffix indicating localisation. For species, this localisation would contain both compartment and tissue (my suggestion is still to use [], as this is the most commonly used method), for anything else this is the Tissue of this entitty (important for genes, as they can/will have different expressions in different tissues). Compartments are special, as they have to include the tissue information (there are no tissues in SBML, and I don't expect them to become part of it any time soon). i.e. a compartment ID could look: globalID_TissueID, with both IDs restricted to letters. Thus, a reaction local ID would be: TK1[Tissue], and a metabolite ID would be atp[Comp_Tissue], each time with the tissue part being optional. This is translated to SBML by converting the brackets to ascii as above.

matthiaskoenig commented 6 years ago

This reads all very complicated. It is not necessary to encode the tissue/species/metabolite/gene/reaction in the id at all, this information is easily obtainable from the SBML

This is only about making SBML identifiers and recover them afterwards, this is not about biology so no need for any biological concepts here

How about something simple: [A] valid SBML id, write the id [B] invalid SBML id, perform the following rules

  1. starts with invalid character, e.g. number, append 'ID'
  2. replace invalid characters "-" --> "_" "non-ASCII" --> ASCII(char) [C] store original id as metaid, than it can be easily recovered without any compliated rules

M

On Fri, Oct 20, 2017 at 7:50 AM, Thomas Pfau notifications@github.com wrote:

@draeger https://github.com/draeger The document (at least the one @matthiaskoenig https://github.com/matthiaskoenig pointed to is somewhat odd. As mentioned above, this formulation leads to ambiguities as it is unclear whether something is a double localised (tissue/compartment) or single localised, or even unlocalised item, so I don't really want to adapt that specification, as it is hard to extract the proper information. Lets start with what we currently have in (I woudl actually claim) a lot of models and a "common" way to write stuff in the community: Metabolites tend to look roughly like: metID[compID], where compID regularily is a single character. Reactions don't have any clear specification, but there are some things I have seen in multiple models: EX_metName indicating a metabolite exchanger (However this is very ambigous, as sometime metName is the name including the compartment, sometimes not, sometimes replacing the brackets [] of the compartment part by parenthesis () DMmetName (as with EX but a demand) sink_metName or SK_metName (same as DM_metName). an ID followed by a lower case t or a TAB_metName often indicate a transporter (the latter from compartment A to compartment B). Compartments: Commonly a 1 letter abbreviation. But sometimes more than one letter. Tissues: I haven't yet seen many but they tend to come as Tissue_entityID, or as part of the compartment ID, and the variant used in the BiGG document was actually new to me.

Now, for metabolites, there is the issue, that a model will have multiple metabolites with the same name but different compartments. So using metID as sbml ID is not possible. That leaves us with a compartment ID, that we have to somehow translate. And here comes the issue: The BiGG/CobraPy document suggests to do this by concatenating it to the global id by _. But as mentioned, this leads to ambiguities. We have the [id] compartment specifier (which we can't change imo), and we currently simply translate this into a valid sBML by the ASCII(char) translation. So we keep the full ID of the metabolite (replacing only invalid characters and adding a prefix). Yes, this becomes less human readable in the SBML. But it is a straightforward regular expression that can be used to convert it back. Also having this tissue information similar to a compartment in the end will (imo) lead to problems. A: This way, the SBML will indicate, that the metabolites are in the very same compartment, and it doesn't know about the tissue. So my personal opinion here is to merge Compartment and Tissue IDs, i.e. have the Compartment/Tissue be one unit. You can have a scheme in the compartment ID or name, that indicates tissue-association, but I wouldn't just put it into the metabolite. And if it is stored in the compartment, the compartment IDs do no longer match the compartment IDs written in the metabolites, so this becomes a mess.

Personally, what I would do: global ID -> In SBML this is essentially the SpeciesType of all Species that are localised in any compartment for mets. For anything else, this is the general ID of the entity. This ID, when exported from COBRA (or BiGG) starts with a prefix for the type of entity (to allow for starting numbers in the ID), and any non SBMLID conforming char is replaced by its ascii number in double underscores . This way, we don't have to restrict the ids in our models at all, and any language can translate the string for output, and use the ID for computation) localID -> This would be the actual SBML Species/Reaction/GeneProduct with a suffix indicating localisation. For species, this localisation would contain both compartment and tissue (my suggestion is still to use [], as this is the most commonly used method), for anything else this is the Tissue of this entitty (important for genes, as they can/will have different expressions in different tissues). Compartments are special, as they have to include the tissue information (there are no tissues in SBML, and I don't expect them to become part of it any time soon). i.e. a compartment ID could look: globalID_TissueID, with both IDs restricted to letters. Thus, a reaction local ID would be: TK1[Tissue], and a metabolite ID would be atp[Comp_Tissue], each time with the tissue part being optional. This is translated to SBML by converting the brackets to ascii as above.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/6#issuecomment-338112255, or mute the thread https://github.com/notifications/unsubscribe-auth/AA29ut2HliconWnCSnmlE7D4JjjTpCILks5suDSRgaJpZM4P7x1d .

-- Dr. Matthias König Junior Group Leader LiSyM - Systems Medicine of the Liver Humboldt-University Berlin, Institute of Biology, Institute for Theoretical Biology https://www.livermetabolism.com konigmatt@googlemail.com Tel: +49 30 20938450 Tel: +49 176 81168480

tpfau commented 6 years ago

One question: Why handle - different from other non valid SBML characters, i.e. why have an exception there? And One comment The COBRA Toolbox relies on metabolite ids containing the compartment identifiers. Thus, it will add a [compartmentID] to any metabolite which does not have it. Therefore, when exporting this, this will be part of the ID. If we skip it, atp[c] and atp[m] would have the same ID, i.e. the model would be invalid. So input and output from readSBML/writeSBML will only be the same, if it comes from the toolbox (or keeps the scheme outlined above). And e.g. BiGG models would be changed. I don't mind that, but its just a remark.

tpfau commented 6 years ago

Also: When reading in an ID: how do you distinguish between an Prefix ID and ID that is part of the actual identifier? (Thats why I would keep the specific prefixes. They also insure against someone having the same ID for a species or Gene Product, which could happen, if e.g. proteins are explicitly synthesized, and present both as geneproducts and species).

matthiaskoenig commented 6 years ago

The "-" exception would just keep most ids readable, because the '-' occurs quit often as a simple separator, whereas other characters beside [A-Z] are not occurring very often. pg-abc --> pg_abc or pgabc is much more readble than pgASCII(123)__abc. But I am also okay with treating it identical to the other non-supported characters in SBML ids.

When reading the ids, you just recover from the metaids, i.e. something like id = sbase.getId() metaid = sbase.getMetaId() if metaid != id: id = metaid So very ease to get back to the original id, and you know if things were prefixed with ID.

Ids are globally unique within an SBMLDocument so no issues with identical GeneProduct ids or species ids. This is just not valid SBML.

In my opinion, that the COBRA toolbox changes the ids on import is a bug. The implementation should not rely on a specific naming scheme of the ids to make a model work. When reading a model with the cobra toolbox and writing it afterwards the ids should be maintained. Otherwise how would I map resulting flux distributions on my SBML network. You just have to keep a dictionary of the original SBML identifiers and when exporting the model write the again.

On Fri, Oct 20, 2017 at 11:42 AM, Thomas Pfau notifications@github.com wrote:

Also: When reading in an ID: how do you distinguish between an Prefix ID and ID that is part of the actual identifier? (Thats why I would keep the specific prefixes. They also insure against someone having the same ID for a species or Gene Product, which could happen, if e.g. proteins are explicitly synthesized, and present both as geneproducts and species).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/6#issuecomment-338158859, or mute the thread https://github.com/notifications/unsubscribe-auth/AA29ut2xkOIWVXzwtmghY2NdCDcREoZQks5suGsZgaJpZM4P7x1d .

-- Dr. Matthias König Junior Group Leader LiSyM - Systems Medicine of the Liver Humboldt-University Berlin, Institute of Biology, Institute for Theoretical Biology https://www.livermetabolism.com konigmatt@googlemail.com Tel: +49 30 20938450 Tel: +49 176 81168480

draeger commented 6 years ago

@tpfau, the information in which compartment a species is can be found in the compartment attribute of this species. If you rely on the id attribute here and parse the information out there, you could introduce some inconsistency or maybe even ignore the compartment attribute?? Why not just obtaining the information from there?

The class https://github.com/SBRG/ModelPolisher/blob/master/src/edu/ucsd/sbrg/bigg/BiGGId.java parses the different id components of a BiGG id. So far, I didn't come accross any ambigiouities. Just see how we decompose the id there to find the information needed. As said, it can be nice to have an id that can be somehow understood by looking at it.

However, it should not be the reliable source of information. Just as @matthiaskoenig pointed out, this introduces potential conflicts and contradictions with other attributes where the same information is stored in a well organized way.

rmtfleming commented 6 years ago

Hi All,

It is strange to hear that there is no need for biological concepts in discussions about SBML.

Here is an important biological concept: The same metabolite often appears in multiple different compartments. Therefore, even if we have a unique ID for each metabolite, and a unique ID for each compartment, we still need a way to distinguish between the same metabolite in two different compartments. metA[x] and metA[y] does this in the COBRA toolbox. I think it is a valid debate whether the compartment identifier should be identifiable with [ ] or something else, but some mechanism must exist to support the distinction. With the SBML specification as it is (which should not be immutable) what is the proposition to handle this situation?

Regards,

Ronan

On 20 October 2017 at 12:19, Andreas Dräger notifications@github.com wrote:

@tpfau https://github.com/tpfau, the information in which compartment a species is can be found in the compartment attribute of this species. If you rely on the id attribute here and parse the information out there, you could introduce some inconsistency or maybe even ignore the compartment attribute?? Why not just obtaining the information from there?

The class https://github.com/SBRG/ModelPolisher/blob/master/src/ edu/ucsd/sbrg/bigg/BiGGId.java parses the different id components of a BiGG id. So far, I didn't come accross any ambigiouities. Just see how we decompose the id there to find the information needed. As said, it can be nice to have an id that can be somehow understood by looking at it.

However, it should not be the reliable source of information. Just as @matthiaskoenig https://github.com/matthiaskoenig pointed out, this introduces potential conflicts and contradictions with other attributes where the same information is stored in a well organized way.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/6#issuecomment-338167340, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDCOgncuUBefwr2z59PZ8NhkN7Qg2kBks5suHOcgaJpZM4P7x1d .

--

Mr. Ronan MT Fleming B.V.M.S. Dip. Math. Ph.D.

Senior research associate (EN) == Chercheur (FR), Principal investigator, Systems Biochemistry Group, wwwen.uni.lu/lcsb/research/systems_biochemistry Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, 6, avenue du Swing, L-4367 Belvaux. & National Centre of Excellence in Research on Parkinson’s disease www.parkinson.lu & Adjunct Assistant Professor, Division of Analytical Biosciences, Leiden Academic Centre for Drug Research, Faculty of Science, University of Leiden. http://analyticalbiosciences.leidenuniv.nl

Mobile: +352 621 175 112 Office: +352 466 644 5528 Skype: ronan.fleming

(This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies.)

draeger commented 6 years ago

@rmtfleming, of course! And SBML does have an attribute compartment on species, where you can say

<species id="M_ATP" compartment="c" ... />

So, you don't need to have the compartment also in the id. Imagine, someone would specify compartment "c" as the value of the compartment attribute and compartment "n" in the id. What do you do? Valid is only what is in the compartment attribute, becuase the id is just an identifier for an object and has no semantic meaning.

Everything should be separately encoded in specific attributes or annotations.

draeger commented 6 years ago

And about the reactions: we have requested specific SBO terms to express that a reaction is an

So, instead of relying on the id attribute having some prefix or infix EX_ or DM_ etc, you can just make use of the sboTerm attribute and specify exactly what is meant. Here is an example:

<reaction id="R_EX_ac_e" ... metaid="R_EX_ac_e" name="Acetate exchange" sboTerm="SBO:0000627">

As yo usee, you can still have your preferred prefix R_EX_, but there is a clear definition linked to the object via the SBO term attribute sboTerm="SBO:0000627"

matthiaskoenig commented 6 years ago

Yes, the ids for the same metabolites in different compartments have to be unique, but how one makes them unique is for me unimportant. If possible there should be a guideline, but no implementation should depend on this, because this brakes down latest when you couple models, have tissue and multiscale and community models. There you could have 20 submodels of identical cells, in a tissue with some microorganisms, everything coupled. So a multitude of submodels you have atp_c (cytosolic ATP). So in the overall model should it be atp_c_ecoli_colon_cell2000 and atp_c_human_colon_cell5? This is just not working, i.e., no biological concepts like compartments in the identifiers, but of course you have the biological concepts in your SBML models. In simple single cell models like BiGG models the guideline could be M_atp_c to create a localized id for the metabolite, but the implementation must use the compartment information, not infer things from the id! But this will break down for coupled models, communities, ...

On Fri, Oct 20, 2017 at 12:44 PM, Andreas Dräger notifications@github.com wrote:

And about the reactions: we have requested specific SBO terms to express that a reaction is an

As yo usee, you can still have your preferred prefix R_EX_, but there is a clear definition linked to the object via the SBO term attribute sboTerm="SBO:0000627" — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

-- Dr. Matthias König Junior Group Leader LiSyM - Systems Medicine of the Liver Humboldt-University Berlin, Institute of Biology, Institute for Theoretical Biology https://www.livermetabolism.com konigmatt@googlemail.com Tel: +49 30 20938450 Tel: +49 176 81168480

tpfau commented 6 years ago

@draeger as for BiGG: I didn't notice that they specified compartments to be lower and tissues to be uper case. So ok. that works. One thing about compartment/tissues: If I'm not mistaken, SBML would expect one compartment for each compartment in each tissue. Correct? As for IDs: My problem with MetaIDs, is that I still don't really get which characters are allowed in there and which are not, as the description isn't very clear and points on to xml definitions etc. If I get this right, [] would still not be allowed in a metaID, along with a lot of other characters (like e.g + in a potential "fe+2" for iron2+ ). Essentially the restrictions are very much the same as for the normal ID except for some obsucure "modification" characters (accents, etc). So it would still be: MetaID = ID with all disallowed chars replaced by __Ascii(char)__ With the difference, that no prefix is needed.

tpfau commented 6 years ago

And for the Toolbox: It is essentially a question of how do we export things to SBML in a way, that read(write(model)) == model or at least as close to model as at all possible and similarily write(read(ModelFile)) should be as close to ModelFile as possible.

draeger commented 6 years ago

@tpfau, you can find several patterns for charcters within SBML Strings defined in the JSBML class SyntaxChecker: https://github.com/sbmlteam/jsbml/blob/master/core/src/org/sbml/jsbml/validator/SyntaxChecker.java

Yes, reading and writing the model should yield similar (or at best even identical) results. The idea of human-readable id's is nice, however it introduces several difficulties, including mapping problems and potentially conflicting information with dedicated attribute values. So, I would not only rely on that.

If there is something in the id that can yet not be represented in SBML, we should find a way to have a specific attribute or annotation etc. for it, so that the id does not have to be semantically overloaded anymore.

mhucka commented 6 years ago

Joining the discussion a bit late, but I wanted to clarify and comment about a few things:

More generally, I find that a lot of questions about identifier syntax are driven by people's unspoken assumptions of what an "identifier" is in a particular context or application. For instance, we sometimes think of identifiers as being some kind of immutable quality of a thing. But they are not; they are just labels. My social security number is an identifier representing me, but it's not the only identifier for me; it's used by some software system somewhere that performs computations, but the number itself is neither me, nor the database record representing me, nor the only information attached to the database record. SBML identifiers are a bit like social security numbers for model entities, and the model entities are a bit like the database records in social security databases, except even less so, because two separate but conceptually identical SBML models might use entirely different identifiers for the conceptually-identical things.

With that in mind, here is how I would answer some specific questions asked upthread:

Finally, I confess I didn't read everything in this long issue discussion; hopefully this addressed some dangling "why" questions, but please let me know if not.

tpfau commented 6 years ago

@draeger Thanks for the link. My problem was mainly that I wasn't sure, which characters are encoded by the Unicode codes (and I did not initially want to look them up). About your last comment: The "name" attribute is that place, but that is used for longer names. We essentially have the problem, that there was no initial restriction on how IDs in the Toolbox are allowed to look, and we can't really change that now. I still think, that "Prefix_ConvertedID" is a very straightforward way to do this. Yes, these IDs are NOT human readable, but they can easily be converted upon IO. @mhucka Well, backward compatability is one of our issues here, and (unfortunately) most CBM models do and did use the ID field as an "abbreviation" storage and name to store the full name. So we have loads of models out there, where the IDs do have meaning. Along those lines: is there an easy way to store an additional label on a SBML entity, or an abbreviation? If we would like to translate the current setup to something less ID dependent we would have: ID -> someNonInterpretableID (e.g. entity1) Name -> The shorthand we currently use in COBRA models as "id" (e.g. atp[c]) ? -> The full name currently stored in COBRA models (e.g adenosine triphosphate) What could this ? be ? It can't be an annotation, as this is an individual choice of the creator of the model, and thus not necessarily present in any database. Notes is discouraged, as this is not information that machines should be interpreting. For fbc:GeneProducts, there is a "label" property, which could be used, but for other SBML entities?

tpfau commented 6 years ago

I'll try to summarize the current situation: We have fields in the programming language model structure (model.mets; .rxns; .genes; .comps in matlab, id variables of objects in cobraPy). These ids are not necessarily usable as SBML IDs.
Also, different fields can potentially have duplicate ids between fields (a mets entry could be the same as a genes entry). While this is unlikely it is possible. We need to convert these to SBML IDs in an as unique as possible and revertible way. We can't just use the name attribute, as this will make most old models invalid (name used to be a descriptive term, which used to be the same for all compartments, i.e. multiple metabolites would have the same model ID -> invalid).

We have the issue of backward compatability: Old (and also BiGG models) converted IDs such that compartments were translated from met[x] to met_x. This should be considered while reading models as old code might rely on this. E.g. if we don't consider this our new metabolite might look met_x[x], or in cobraPy have the id met_x and the compartment x while someone would look for met.

Further "complications": Novel multi-tissue models or multi-species models (the latter are essentially the same as the former from a computational perspective, they have multiple "supercompartments" (Tissues/Species) surrounding normal compartments (compartments being used in the COBRA context, i.e. separated cellular compartments) and connected by a connecting compartment (blood/external medium). In SBML these will have to be encoded as individual Compartments, but if we want to distinguish them, we will have to create some way to annotate this. The groups plugin might be useful here (grouping multiple compartments into one Tissue/Species.

My personal opinion: For backward compatability I would not like to use anything but the ID field to export/import IDs. I would still trim trailing _x identifiers that match the corresponding compartment identifier. I would, if they are not present add the corresponding compartment to the ID of a metabolite.

Output from a Toolbox Model: I would define a clear conversion scheme from model ID (for toolbox metabolites that would mean including the compartment) to SBMLID (as detailed above). I would add a prefix to all entity classes (R M etc.) Input to the Toolbox: I would remove prefixes IF they are present on all entities of a given type. I would convert the IDs back to Toolbox IDs. I would add compartment IDs if necessary (i.e. if no compartment ID exists in the ID).

Tissues: I would assign them as part of the compartment. essentially a compartment ID (without Prefix) being any string of the form: [^\[\]_]+(_[^\[\]_]*)? (i.e. a compartment id can be anything that does not contain a [] bracket and no underscore except to separate compartment and tissue/species information).

rmtfleming commented 6 years ago

Hi All,

perhaps I should qualify my earlier question about constraints on allowable characters. Why is it that [ ] are deemed to be incompatible with the characters in an identifier?

Also, it seems inconsistent with the chemoinformatics community to claim that a (canonical) InChI is not an identifier....

"What is an InChI? InChI is an acronym for IUPAC International Chemical Identifier. It is a string of characters capable of uniquely representing a chemical substance and serving as its unique digital ‘signature’. It is derived solely from a structural representation of that substance in a way designed to be independent of the way that the structure was drawn. A single compound will always produce the same identifier. In one sentence: InChI provides a precise, robust, IUPAC approved structure-derived tag for a chemical substance. http://www.inchi-trust.org/technical-faq/#2.1"

Are all of the characters feasible in an InChI also compatible with the characters allowed in an SBML identifier? If not, why not?

A more widespread adoption of database independent identifiers, e.g., InchI and RInchI (http://www-rinchi.ch.cam.ac.uk/), would help to avoid the proliferation of database-specific identifiers. It is tiring to ever more myOwnDatabaseID's being used everywhere, especially for metabolic models, where there are database independent identifiers.

Regards,

Ronan

On 25 October 2017 at 11:20, Thomas Pfau notifications@github.com wrote:

I'll try to summarize the current situation: We have fields in the programming language model structure (model.mets; .rxns; .genes; .comps in matlab, id variables of objects in cobraPy). These ids are not necessarily usable as SBML IDs. Also, different fields can potentially have duplicate ids between fields (a mets entry could be the same as a genes entry). While this is unlikely it is possible. We need to convert these to SBML IDs in an as unique as possible and revertible way. We can't just use the name attribute, as this will make most old models invalid (name used to be a descriptive term, which used to be the same for all compartments, i.e. multiple metabolites would have the same model ID -> invalid).

We have the issue of backward compatability: Old (and also BiGG models) converted IDs such that compartments were translated from met[x] to met_x. This should be considered while reading models as old code might rely on this. E.g. if we don't consider this our new metabolite might look met_x[x], or in cobraPy have the id met_x and the compartment x while someone would look for met.

Further "complications": Novel multi-tissue models or multi-species models (the latter are essentially the same as the former from a computational perspective, they have multiple "supercompartments" (Tissues/Species) surrounding normal compartments (compartments being used in the COBRA context, i.e. separated cellular compartments) and connected by a connecting compartment (blood/external medium). In SBML these will have to be encoded as individual Compartments, but if we want to distinguish them, we will have to create some way to annotate this. The groups plugin might be useful here (grouping multiple compartments into one Tissue/Species.

My personal opinion: For backward compatability I would not like to use anything but the ID field to export/import IDs. I would still trim trailing _x identifiers that match the corresponding compartment identifier. I would, if they are not present add the corresponding compartment to the ID of a metabolite.

Output from a Toolbox Model: I would define a clear conversion scheme from model ID (for toolbox metabolites that would mean including the compartment) to SBMLID (as detailed above). I would add a prefix to all entity classes (R M etc.) Input to the Toolbox: I would remove prefixes IF they are present on all entities of a given type. I would convert the IDs back to Toolbox IDs. I would add compartment IDs if necessary (i.e. if no compartment ID exists in the ID).

Tissues: I would assign them as part of the compartment. essentially a compartment ID (without Prefix) being any string of the form: [^[]]+([^[]_]*)? (i.e. a compartment id can be anything that does not contain a [] bracket and no underscore except to separate compartment and tissue/species information).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/6#issuecomment-339268790, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDCOobFzd4gMezA0wbzNuhg00YZIpNSks5svv1ngaJpZM4P7x1d .

--

Mr. Ronan MT Fleming B.V.M.S. Dip. Math. Ph.D.

Senior research associate (EN) == Chercheur (FR), Principal investigator, Systems Biochemistry Group, wwwen.uni.lu/lcsb/research/systems_biochemistry Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, 6, avenue du Swing, L-4367 Belvaux. & National Centre of Excellence in Research on Parkinson’s disease www.parkinson.lu & Adjunct Assistant Professor, Division of Analytical Biosciences, Leiden Academic Centre for Drug Research, Faculty of Science, University of Leiden. http://analyticalbiosciences.leidenuniv.nl

Mobile: +352 621 175 112 Office: +352 466 644 5528 Skype: ronan.fleming

(This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies.)

tpfau commented 6 years ago

Are all of the characters feasible in an InChI also compatible with the characters allowed in an SBML identifier?

No. SBML IDs have to fullfil the regexp: (_|[a-zA-Z])[0-9a-zA-Z_]* As to why: I assume to allow them to be used as variable names in as many programming languages as possible, but I'm not sure.

mhucka commented 6 years ago

Hello,

Sorry for not responding sooner. Looking back at @tpfau's comments of 9 days ago and working forward from there:

I'll have to come back and explain the InChI matter later.

rmtfleming commented 6 years ago

Hi Mike,

could it be that the characters currently allowed in an SBML identifier are unnecessarily overconstrained?

In the section "3.1.7 Type SId" of the document "SBML Level 3 Version 2 Core, Release 1 (Release Candidate 2), 4 November 2016" http://sbml.org/Special/specifications/sbml-level-3/version-2/core/release-1-rc2/sbml-level-3-version-2-core.pdf one can read the following:

"SId is a data type derived from the basic XML type string, but with restrictions about the characters permitted and the sequences in which those characters may appear."

The explanation given in the following paragraph is sufficient to see why "Type SId is purposefully not derived from the XML ID type." However, that is an argument about the way an SId vs an XML ID is to be interpreted, rather than what characters each can contain.

However, there is no explanation after the statement that "SId does not include Unicode character codes; the identifiers are plain text." What is the explanation?

If the XML specification of an identifier allows Unicode character codes, it would seem like, in principle, compatibility between programming languages can still be satisfied by XML while allowing identifiers to contain unicode characters, including: [ ]. https://www.w3.org/TR/2004/REC-xml-20040204/#NT-Char

I understand it might be over the top to allow any combination of Unicode characters for an SId. Consider the set of printable ASCII characters, available in matlab using:

-

ascii = char(reshape(32:127,32,3)')ascii =! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _' a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

I can see from this list that it would not make sense to allow " to be a character inside an SId as " is used in the SBML file to delimit the set of characters allowed to make up an id.

However, I cannot see why [ or ] cannot be allowed, or why it is inconsistent with operability by various different programming languages. Specifically, which programming languages would have a difficulty with an id of the form "x[y]" ?

Just for the sake of illustration, one gets the following from matlab:

tmp = 'x[y]'

tmp =

x[y]

x[y] x[y] ↑ Error: Unbalanced or unexpected parenthesis or bracket.

Regards,

Ronan

On 1 November 2017 at 06:31, Mike Hucka notifications@github.com wrote:

Hello,

Sorry for not responding sooner. Looking back at @tpfau https://github.com/tpfau's comments of 9 days ago and working forward from there:

-

"... backward compatability is one of our issues here, and (unfortunately) most CBM models do and did use the ID field as an "abbreviation" storage and name to store the full name": I can certainly understand the difficulties of maintaining backward compatibility, and empathize :-).

"there an easy way to store an additional label on a SBML entity, or an abbreviation?": since this would be application-specific (or community-specific I suppose), this is what the machine-readable

element is intended to support. You could invent a simple annotation syntax and store labels and additional information in there. What goes in is largely up to users/developers, so you wouldn't be constrained about identifier/label syntax (well, apart from some constraints imposed by XML and UTF-8, but I wager they'd have minimal impact on this situation). - *"It can't be an annotation, as this is an individual choice of the creator of the model, and thus not necessarily present in any database"*: I'm not sure why it couldn't be an annotation, if by annotation you mean the element in SBML. An annotation can be the choice of a creator of a model. The "not necessarily present in any database" would seem to be irrelevant in this situation, unless I'm missing something (which is entirely possible). If you can elaborate further, I can try to help figure this out. - *"We can't just use the name attribute, as this will make most old models invalid (name used to be a descriptive term, which used to be the same for all compartments, i.e. multiple metabolites would have the same model ID -> invalid)."*: that's unfortunate. It would have been an easy solution. (I feel I have to point out that it should be possible to put a version number on your format, and you could use different conventions in a new format and software could still recognize the different assumptions by checking the version number. However, it's true that this might work for software but would probably lead to a lot of confusion for humans.) - _"For backward compatability I would not like to use anything but the ID field to export/import IDs. I would still trim trailing *x identifiers that match the corresponding compartment identifier."*: It's not ideal, but doable, as long as it's understood that there are risks in cases where someone's model did not follow the assumptions about what y_x means. - (The procedure for what to do with input & output to the Toolbox): I didn't read this in detail yet but I'll point the team to this and ask if anyone has ideas. - *"Why is it that [ ] are deemed to be incompatible with the characters in an identifier?"*: part of the reason why SBML id's have such simple syntax (at least in the early days) is to make it easier for people to write programs in scripting languages. In many of those languages, identifiers syntaxes are limited and characters such as [ and ] have specific meaning. For example, imagine the result of writing x[y] in Matlab. Again, the identifier syntax was not meant to allow the expression of full entity descriptive names -- that's why there is also a name field. I'll have to come back and explain the InChI matter later. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

--

Mr. Ronan MT Fleming B.V.M.S. Dip. Math. Ph.D.

Senior research associate (EN) == Chercheur (FR), Principal investigator, Systems Biochemistry Group, wwwen.uni.lu/lcsb/research/systems_biochemistry Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, 6, avenue du Swing, L-4367 Belvaux. & National Centre of Excellence in Research on Parkinson’s disease www.parkinson.lu & Adjunct Assistant Professor, Division of Analytical Biosciences, Leiden Academic Centre for Drug Research, Faculty of Science, University of Leiden. http://analyticalbiosciences.leidenuniv.nl

Mobile: +352 621 175 112 Office: +352 466 644 5528 Skype: ronan.fleming

(This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies.)

draeger commented 6 years ago

My understanding is that what the COBRA community preferably needs is a way to have both, a descriptive name and a shorthand label for SBML components, rather than an actual change of the allowed ID characters.

I see two possible ways to solve this most easily:

  1. You can use annotation as @mhucka suggested and defined key-value pairs for a label there.
  2. You could define a cobra package for SBML.

For the first approach, you need to define a cobra namespace and then have just one key-value pair that would appear within the annotation element of SBML components where you need it.

The second approach would allow you to have a direct additional attribute cobra:label for all SBML elements. All you need to do is to define an extension of the abstract SBML element SBase and the label would be accessible everywhere.

While it seems straightforward, it is not easy to just change the definition of SBML ids for the same reason as you describe for COBRA: backward compatibility. Imagine SBML Level 3 Version 2 came with new allowed characters for identifiers, then to benefit from this COBRA developers would also need to implement support for other features introduced with this new format, which has some strings attached. New rules for ids would even have an impact on other tool developers and would, therefore, involve a discussion with further modeling communities. Hence, it can be a complicated process. Maybe we should instead search for a solution that works for you using available methods that we have already.

Besides tool-specific annotation within a cobra namespace or a new extension package, it would also be possible to define a kind of a model glossary file that has additional information for each metaid in a model and could have your short-hand label and arbitrary names attached. Such an approach is currently being developed and supported by @matthiaskoenig. A glossary file could even open the door to localization support for descriptive names depending on user settings, etc. However, it would require you to load and write more than one file, unless you move to COMBINE archive as import/export format that is a ZIP archive with a manifest file and can have multiple files in it, including a model glossary.

To move forward, let's think about which of the two approaches above (or the glossary file approach) could be most beneficial for you and I'd be more than happy to assist you with next steps.

rmtfleming commented 6 years ago

Hi Andreas,

thanks for the alternate propositions. However, I think we are reaching a fundamental point here. On the one hand, we are being told the way we define ID's in a COBRA model is not compatible with a SBML ID, yet at the same time an InChI is not compatible with an SBML ID either, and the SBML id's are constrained to a restricted subset of the characters allowed by an XML ID. I would prefer a logical explanation for the current status before deciding how to move forward.

Does anyone know why the SId was defined as a very restricted subset of characters?

Regards,

Ronan

On 1 November 2017 at 09:45, Andreas Dräger notifications@github.com wrote:

My understanding is that what the COBRA community preferably needs is a way to have both, a descriptive name and a shorthand label for SBML components, rather than an actual change of the allowed ID characters.

I see two possible ways to solve this most easily:

  1. You can use annotation as @mhucka https://github.com/mhucka suggested and defined key-value pairs for a label there.
  2. You could define a cobra package for SBML.

For the first approach, you need to define a cobra namespace and then have just one key-value pair that would appear within the annotation element of SBML components where you need it.

The second approach would allow you to have a direct additional attribute cobra:label for all SBML elements. All you need to do is to define an extension of the abstract SBML element SBase and the label would be accessible everywhere.

While it seems straightforward, it is not easy to just change the definition of SBML ids for the same reason as you describe for COBRA: backward compatibility. Imagine SBML Level 3 Version 2 came with new allowed characters for identifiers, then to benefit from this COBRA developers would also need to implement support for other features introduced with this new format, which has some strings attached. New rules for ids would even have an impact on other tool developers and would, therefore, involve a discussion with further modeling communities. Hence, it can be a complicated process. Maybe we should instead search for a solution that works for you using available methods that we have already.

Besides tool-specific annotation within a cobra namespace or a new extension package, it would also be possible to define a kind of a model glossary file that has additional information for each metaid in a model and could have your short-hand label and arbitrary names attached. Such an approach is currently being developed and supported by @matthiaskoenig https://github.com/matthiaskoenig. A glossary file could even open the door to localization support for descriptive names depending on user settings, etc. However, it would require you to load and write more than one file, unless you move to COMBINE archive as import/export format that is a ZIP archive with a manifest file and can have multiple files in it, including a model glossary.

To move forward, let's think about which of the two approaches above (or the glossary file approach) could be most beneficial for you and I'd be more than happy to assist you with next steps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/6#issuecomment-341035445, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDCOuqLBB4RpHnAngZWBVQgt5JzGPc4ks5syC-9gaJpZM4P7x1d .

--

Mr. Ronan MT Fleming B.V.M.S. Dip. Math. Ph.D.

Senior research associate (EN) == Chercheur (FR), Principal investigator, Systems Biochemistry Group, wwwen.uni.lu/lcsb/research/systems_biochemistry Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, 6, avenue du Swing, L-4367 Belvaux. & National Centre of Excellence in Research on Parkinson’s disease www.parkinson.lu & Adjunct Assistant Professor, Division of Analytical Biosciences, Leiden Academic Centre for Drug Research, Faculty of Science, University of Leiden. http://analyticalbiosciences.leidenuniv.nl

Mobile: +352 621 175 112 Office: +352 466 644 5528 Skype: ronan.fleming

(This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies.)

draeger commented 6 years ago

@rmtfleming: I might be wrong because I joined SBML development when it was already defined this way. My understanding is that early on developers desired to be able to directly convert SBML files to some source code, where it would be helpful if identifiers could be directly used as variable names. In this sense, square brackets would have caused problems back then. Once this decision was made, it was kept like this. @mhucka may be able to explain this better.

For now, I'd like to search for a solution that works for you and allows you to have short-labels together with more descriptive display names for all relevant SBML objects.

skeating commented 6 years ago

Indeed initially SBML was seen literally as a means to exchange a model between tools that then created their own code/encoding/structure etc in order to perform whatever analysis they wanted. This was 17 years ago and so to facilitate use by as many existing tools the syntax was kept very narrow and allowed no characters that may have 'meaning' in any sort of code.

Backwards compatibility has always been a very strong component of SBML development and so the syntax was never changed. Changing it now would mean that any new models using a new syntax would be unreadable by hundreds of existing tools.

Also there is no requirement that a software maintains the ids of a model it reads. I know of several tools that instantly change the ids to ones more suited to their tool and then export the 'same' model with changed ids. The ids are merely to allow software to identify elements. So a model using an id that was intended to convey some sort of additional information would round trip through some software with that information totally lost.

@tpfau is correct in that he can read models and create COBRA appropriate ids and then change them back if exporting SBML. Both @mhucka and @draeger suggest solutions that would allow this information to be preserved and roundtripped. The real difficulty is in models that already attach meaning to an id - whether by _ or [] but do not use any new scheme produced. But that would equally apply if we did change the allowable syntax of an id. You would still need some way of knowing whether m[c] or m_c meant something about m and c or whether the user had just used m[c] or m_c as an identifier.

mhucka commented 6 years ago

@rmtfleming In this comment:

However, I cannot see why [ or ] cannot be allowed, or why it is inconsistent with operability by various different programming languages. Specifically, which programming languages would have a difficulty with an id of the form "x[y]" ?

It is the difference between typing "x[y]" and x[y]. One is a string. Some tools expose identifiers directly. So for instance, they might allow a user to type

a + b -> c

They don't allow you to type

a[x] + b[x] -> c

or if they do, it has a different meaning than interpreting the sequence of characters a[x] as a symbol with the characters a [ x ]. Indeed, in your own example with Matlab, you demonstrated the difference between a variable having a string value, and the interpretation of the same character sequence as a symbol. Try typing x[y] in other programming languages – not "x[y]", but x[y]. The result is likely to be different.

This design goal (making it easy for systems that expose id values as the symbols or objects to be accessed by users) is also the reason why the characters for identifiers are limited to plain text and not Unicode. This is the answer to your question:

"SId does not include Unicode character codes; the identifiers are plain text." What is the explanation?

(It's worth noting that many systems do not allow unicode characters in symbol names even today, although it's true that this has changed a lot in the last decade.)

On the topic of Unicode: the XML specifications did not introduce the use of Unicode until later, after the original id syntax was defined in SBML L1v1. The XML specification you linked to is from 2004 – some years after that. We had used the 2001 version, and unfortunately, once that was done, we couldn't update the definition without breaking backward compatibility.

All of the above is to explain why identifiers are the way they are. These are not arguments that this was a great design choice when viewed with 20/20 hindsight. They were the best we could do at the time when faced with the constraints and technologies that existed 17 years ago, but faced with the same problem today, we would probably make different choices. (Nonetheless, I think it would be fair to point out that they served a lot of people and projects well enough for a long time, which is pretty darn good, and an indication that the design suited a lot of people.) It is also important to keep in mind that identifiers are meant to be taken together with the name field, and annotations.

mhucka commented 6 years ago

Regarding InChI's as identifiers:

Also, it seems inconsistent with the chemoinformatics community to claim that a (canonical) InChI is not an identifier....

This was in response to my comment:

InChI should not be used as identifiers in any model. InChI is meta information about an entity in a model; it is not an identifier for something. InChI strings should be stored separately.

I see that what I wrote was not expressed very well. Yes, a given InChI string is an identifier ... for a chemical entity. The identifier of an entity in SBML is an identifier for a element in a model. This model entity may indeed represent a pool of chemicals that can be identified by a given InChI string; however, the model entity is conceptually a different thing. It's a little bit like an instance of a structure in C, or a row in an SQL database. You would probably not use the InChI string as the identifier of the instance or the row; rather, you would have a field or column named something like "inchi" where you would store the InChI string, as well as have other fields or columns to store other data about that model entity.

So all I'm trying to say is that you wouldn't use the InChI string as the primary identifier of the model entity. You would instead write the InChI as a property of the model entity. That's all I meant by "InChI should not be used as identifiers in any model" – it was really meant specifically about model identifiers, although I didn't make that clear in my comment.

A natural question to consider in this context is, what's wrong with using an InChI string as a species id? To answer that, consider the following scenario. Suppose a model has two compartments (let's call them c1 and c2), and both compartments have some quantity of glucose. The model will need two species definitions: one to represent the pool of glucose in c1 and another for glucose in c2. Now, what will happen if we use the glucose InChI string as the id of both species?

tpfau commented 6 years ago

My point of view: I don't care, where I store the ID. But I need to provide backwards compatability. Thus: The Toolbox will (likely always) interpret species identifiers with _x as part of the compartment identifier, that should be removed on read. This will NOT be used to set the compartment, but it will be used to For output (in the future), I think an annotation would be good. But I'm currently not sure, how best to define this.

"It can't be an annotation, as this is an individual choice of the creator of the model, and thus not necessarily present in any database": I'm not sure why it couldn't be an annotation, if by annotation you mean the element in SBML. An annotation can be the choice of a creator of a model. The "not necessarily present in any database" would seem to be irrelevant in this situation, unless I'm missing something (which is entirely possible). If you can elaborate further, I can try to help figure this out.

Currently we use MIRIAM annotations and the CVTerms extracted from an annotation. Admittedly I'm far from being an xml expert, which is why I said the above, but I get that this was incorrect. If I get this right, we would need to define a xml namespace and an appropriate schema somewhere that we would then need to add to sbml xml file. In this schema, we could declare e.g. a simpleType for the COBRA-ID, with an string element id, an additional for Confidence Scores and further ones as required. This would then be added to the annotation string, and we would also need to parse it again when reading (at least in the Matlab interfaces). What I'm not entirely sure about is, if we could simply concatenate the annotation parts or how this needs to be done to not mix things in the output, but I guess I can figure it out.

This being said: This modification would be a drastic change in the output format, and tools which currently expect a specific format could have a difficult time until updated (if they are still maintained).

rmtfleming commented 6 years ago

Hi Mike,

I agree that using all Unicode characters in an ID would be going too far and is not necessary.

Still, it seems that the character constraints on SId definition are too restrictive:

letter ::= 'a'..'z','A'..'Z'

digit ::= '0'..'9'

idChar ::= letter | digit | '_'

SId ::= ( letter | '_' ) idChar*

Compare this with the ascii characters:

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _' a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

The only ascii character there that I can see should not be part of an ID is " because that character is used to delimit the value of a particular SId instance. Once the " character is reserved as the delimiter, it is hard to see what programming language would not be able to interpret the other ascii characters between " " as a string. Even in matlab, one can put any ascii character between ' ' and it is interpreted as a string.

Why is it ok to have _ but not [ ] or - ? This seems arbitrary.

Another example would be why _ but not \ as appearing within InchI?

To illustrate the sort of problem it creates to restrict the SId characters, consider the case of using D-Glucose as an instance of a primary key, ignoring the problem of compartments at first. This made end users in practice have to replace - with _ so we used D_Glucose as the identifier for the sbml file. However, with compartments, we had D-Glucose[c] and D-Glucose[m] needing to get translated back and forth to D_Glucosec and D_Glucosem. The problem here is that we have to introduce heuristics to know to translate D_Glucosec back into D-Glucose[c] rather than D[Glucose-c] or the like. This problem was caused by overly restricting the set of allowable characters in the SId in the first instance.

When the FBC constraints package was introduced, everybody who wanted to support it had to update their io code for import and export of the new standard for forward compatibility. So neither the tools nor the standards should be set in stone and considered immutable. I very much doubt that the COBRA toolbox is the only tool that is having to translate back and forth between the current restricted SId character set and the set of ascii (less the " character).

The InChI in multiple compartments example that you provided can be flipped the other way, it would be fine to use the InChI as the first part of the SId, then the second part would be some (ideally database independent) identifier for a compartment.

Regards,

Ronan

On 1 November 2017 at 22:09, Mike Hucka notifications@github.com wrote:

Regarding InChI's as identifiers https://github.com/opencobra/schema/issues/6#issuecomment-340760285:

Also, it seems inconsistent with the chemoinformatics community to claim that a (canonical) InChI is not an identifier....

This was in response to my comment https://github.com/opencobra/schema/issues/6#issuecomment-338285506:

InChI should not be used as identifiers in any model. InChI is meta information about an entity in a model; it is not an identifier for something. InChI strings should be stored separately.

I see that what I wrote was not expressed very well. Yes, a given InChI string is an identifier ... for a chemical entity. The identifier of an entity in SBML is an identifier for a element in a model. This model entity may indeed represent a pool of chemicals that can be identified by a given InChI string; however, the model entity is conceptually a different thing. It's a little bit like an instance of a structure in C, or a row in an SQL database. You would probably not use the InChI string as the identifier of the instance or the row; rather, you would have a field or column named something like "inchi" where you would store the InChI string, as well as have other fields or columns to store other data about that model entity.

So all I'm trying to say is that you wouldn't use the InChI string as the primary identifier of the model entity. You would instead write the InChI as a property of the model entity. That's all I meant by "InChI should not be used as identifiers in any model" – it was really specifically about model identifiers.

A natural question to consider in this context is, what's wrong with using an InChI string as a species id? To answer that, consider the following scenario. Suppose a model has two compartments (let's call them c1 and c2), and both compartments have some quantity of glucose http://webbook.nist.gov/cgi/inchi/InChI%3D1S/C6H12O6/c7-1-3(9)5(11)6(12)4(10)2-8/h1%2C3-6%2C8-12H%2C2H2/t3-%2C4%2B%2C5%2B%2C6%2B/m0/s1. The model will need two species definitions: one to represent the pool of glucose in c1 and another for glucose in c2. Now, what will happen if we use the InChI string for glucose as the id of both species?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/6#issuecomment-341243359, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDCOoK7icMNpwZzEYbTc6XMw3v1DxLBks5syN33gaJpZM4P7x1d .

--

Mr. Ronan MT Fleming B.V.M.S. Dip. Math. Ph.D.

Senior research associate (EN) == Chercheur (FR), Principal investigator, Systems Biochemistry Group, wwwen.uni.lu/lcsb/research/systems_biochemistry Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, 6, avenue du Swing, L-4367 Belvaux. & National Centre of Excellence in Research on Parkinson’s disease www.parkinson.lu & Adjunct Assistant Professor, Division of Analytical Biosciences, Leiden Academic Centre for Drug Research, Faculty of Science, University of Leiden. http://analyticalbiosciences.leidenuniv.nl

Mobile: +352 621 175 112 Office: +352 466 644 5528 Skype: ronan.fleming

(This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies.)

draeger commented 6 years ago

@tpfau, yes, your description is right. If you plan to write customized tool-specific stuff in a well-defined way in SBML document, you need to create your namespace. The SBML specification of L3V1R2 gives an excellent example of how to do this on pages 15 to 16.

Both SBML libraries, libSBML and JSBML, provide convenience functions to access the content of the non-RDF annotations (non-MIRIAM part of the annotations, i.e., all customized annotations). So you won't have to parse XML code yourself. Instead, you will be provided with objects of type XMLNode that contain attributes and child nodes.

Defining your custom annotation isn't such an effort. You just need to specify some versioned URI, e.g., http://www.cobra.org/schema-1.0/. By including a version number in the namespace, you ensure that you can, later on, introduce changes more easily and tell the parser to expect different pieces information. You do not necessarily have to own the real URL for the namespace, but it would be nice to use one that points to the location where the documentation about your additions is.

So, you basically should write a few paragraphs describing what your namespace is and which attributes and additions you define in there. This document will others allow to read and write this information as well.

tpfau commented 6 years ago

Lets summarize what we would need because we currently can't annotate it properly:

Apart from the confidence Scores and cobra-IDs, all of this is connected to constraint based properties and should thus (in my opinion) be part of the fbc package. So we have 2 items which we could use annotations for. At the same time one of those will always be modified when read and adhering to a specific format, to retain backward compatability (if our annotation is not set, we will use the SBML ID and we will modify that ID if it fits to specific schemata, as this happened in old models). So, there are now two options for the IDs:

  1. Introduce a cobra specific annotation and generate output ids in whatever fashion we like.
  2. Define a clear conversion mechanism from COBRA-ID to SBML-ID.

In both instances any user will have to adapt their system.
For 1, because without their system knowing what annotations to expect, the best thing that could be done is generating a property for each element which combines the namespace and the propertyname (which also does not fit to the ID but only contains it). So one would need specific interpreters to use this information.
For 2. they would need to know the conversion method to be able to convert the IDs. Personally, I think while creating an annotation would have the benefit of later extendability, it comes with the drawback, that things which should go into e.g. fbc will end up in the annotation for some time and only later be incorporated into fbc leading to a lot of additional testing in everyones code (whether the entry is there, or there...). This is something I really want to avoid.

So personally, I would still vote for a clear conversion scheme that everyone can follow on how to get from a COBRA model ID to an SBML ID. In any case, there will be some things that we still have to do to accomodate old models, but thats (at least for us) unavoidable.