SBML -> Store Model IDs

tpfau commented 7 years ago

This is a spin of from #6, concerning only the SBML <-> COBRA ID conversion. Current status is two options: A: Define a clear conversion scheme between COBRA-ID and SBML ID B: Add the fbc label as an extension of SBase instead of only on GeneProducts.

If B does not exist in a model, A will be used (and due to old code, some modifications will happen for backward compatability).

tpfau commented 6 years ago

Since this is coming up again for us, I'm wondering, whether we, at least within the cobra projects can figure out a way how we store our ids independent of conversions. From what I figured in several discussions, we will need to keep a bunch of conversions in place when reading SBML (BioModels uses __DASH__ and similar things), bigg has the R_ / M_ markers and _c nomenclature for localisation etc, while the cobra toolbox has [c] as compartment id. So at least when reading models, we will have to potentially make some adjustments (or just read the data without any modification).

I'm not sure, when a new FBC scheme will be ready and I don't know whether a "label" field is actually useful for us. What I'm wondering is if we can set up an annotation that allows literally "anything" to be stored as id e.g.

<annotation>
<cobra:id>....Something "That Can [be] anything"..</cobra:id>
</annotation>

While I don't like this, it is the only way I can think of how we can handle "any" char...

rmtfleming commented 6 years ago

I think we need to define what characters are and are not going to be permitted, rather than allowing everything, which is too flexible. How about we start with the set of printable ASCII characters, available in matlab using: ascii = char(reshape(32:127,32,3)') ascii = ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ' a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

Then excluding certain special characters, such as [ ) |, or certain phrases 'and' 'or'. Once we have a proposition, we should try to get agreement across the cobra community.

draeger commented 6 years ago

This sounds very reasonable to me! As a side note, for this annotation to work, we will also need to define a unique namespace URI, for instance, https://github.com/opencobra/schema/sbml/v1 or so, where v1 denotes a version number so that it can be updated later on.

Midnighter commented 6 years ago

On the cobrapy side, if we stay with the current implementation it would also be very desirable to turn IDs into valid Python variable names. So they should follow a regex such as [a-zA-Z_][a-zA-Z0-9_]*. It's desirable for us because it facilitates quick access to the metabolite and reaction objects without causing a syntax error.

tpfau commented 6 years ago

And here we got our first major clash :smile: Cobra Toolbox ids commonly come with '[]' and '()' as the metabolite compartment is stored in the metabolite id. So we currently always have to convert our ID into something that is SBMLID compatible. But how does cobrapy handle the current ids? Is it using the SBML-IDs directly (which should be possible)?

With respect to restrictions: I'm fine with whatever choice is made there, but we have to ensure that we don't forbid existing names. Overall I personally think it is better to allow any ID and store it as a simple string than to start restricting, as this will likely lead to problems for someone. Yes, a restriction to printable ASCII characters makes sense, but other restrictions are likely to cause issues in the future. Also, if we don't stick to the SBMLID restrictions, we still have the problem of how to store this id, and thus of conversion, and the only reason this is problematic is because of conversions, so if we don't need to convert it, we don't need to restrict, and if we need to convert to something matching @Midnighter 's regexp, we have a valid SBML ID, but have to change a lot of things in the Toolbox :smile:

tpfau commented 6 years ago

Coming back to this: I personally don't think, that we need tough restrictions on the model IDs. The only real restrictions I can think of from the COBRA Toolbox perspective (at the moment): metabolite IDs are not allowed to have [] at any place outside the enclosing brackets for the compartment ID.

And even this restriction could be overcome if we switch from the implicit representation of compartments in the ID to an explicit representation of the compartment in a metComps field. We will also soon have to handle the question on how to represent multiple tissues in both SBML and cobra models. Sure, SBML just defines all compartments in different tissues as individual compartments, but we will need to have something that (apart from again storing information in IDs) clarifies which compartments form a group in which tissue. My suggestion would be to use the group extension to group the different compartments and, again store the individual compartment IDs using the COBRA ID annotations.

The benefit would be, that SBML IDs would finally become what they are supposed to: pure computational IDs without attached meaning, while all meaning would be given by annotations. To be able to handle older models, we will of course have to have some type of conversion function for old style IDs but that would be software side implementations which are not interfering with model exchange.
For python, this would essentially mean, that you could use the SBMLID as variable name, but have __str__ and __repr__ return the displayID.

opencobra / schema

SBML -> Store Model IDs #8