opencobra / cobrapy

COBRApy is a package for constraint-based modeling of metabolic networks.
http://opencobra.github.io/cobrapy/
GNU General Public License v2.0
460 stars 216 forks source link

Special character substitution during SBML parsing can break IDs #936

Open forestrywhale opened 4 years ago

forestrywhale commented 4 years ago

Problem description

I'm trying to use the function cobra.io.sbml.read_sbml_model to load a sbml model that I generated using function cobra.io.write_sbml_model. The model passed the online SBML validator checking, though warnings exist for lacking assigned initial values to metabolites.

If I generate the model with cobra.io.write_sbml_model and then load the model with cobra.io.sbml.read_sbml_model, it will give me error msg as:

CobraSBMLError: Something went wrong reading the SBML model. Most likely the SBML model is not valid. Please check that your model is valid using the cobra.io.sbml.validate_sbml_model function or via the online validator at http://sbml.org/validator . (model, errors) = validate_sbml_model(filename) If the model is valid and cannot be read please open an issue at https://github.com/opencobra/cobrapy/issues .

And the cobra.io.sbml.validate_sbml_model give me results as following:

(None, {'SBML_FATAL': [], 'SBML_ERROR': [], 'SBML_SCHEMA_ERROR': [], 'SBML_WARNING': [], 'COBRA_FATAL': ['Names cannot contain whitespace characters. "M\x0545\n45__METHENYL-THF_c" contains whitespace character "\n".'], 'COBRA_ERROR': [], 'COBRA_WARNING': [], 'COBRA_CHECK': []})

However, if use cobra.io.save_json_model and then load with cobra.io.load_json_model, then there would be no error.

I'm not sure why this error happens, hope someone could help me with it.

Midnighter commented 4 years ago

JSON model writing and loading does not do the same kind of validation checks that SBML does. So I'm sorry if that has added confusion.

This part of the validation is quite explicit:

'COBRA_FATAL': ['Names cannot contain whitespace characters. "M\x0545\n45__METHENYL-THF_c" contains whitespace character "\n".'],

It looks also like there are some encoding issues at play? Can you say where you first got the model from?

forestrywhale commented 4 years ago

The original model was downloaded from Ecocyc database and have this header <?xml version="1.0" encoding="UTF-8"?>. I couldn't find anything exactly like "M\x0545\n45__METHENYL-THF_c", but the "METHENYL-THF" hit only 1 record, as following:

      <species id="M__5__45__10__45__METHENYL__45__THF_c" name="5,10-methenyltetrahydrofolate" compartment="c" hasOnlySubstanceUnits="false" boundaryCondition="false" constant="false" fbc:charge="-1" fbc:chemicalFormula="C20H20N7O6">
        <notes>
          <html xmlns="http://www.w3.org/1999/xhtml">
            <p>BIOCYC: 5-10-METHENYL-THF</p>
            <p>INCHI: InChI=1S/C20H21N7O6/c21-20-24-16-15(18(31)25-20)27-9-26(8-12(27)7-22-16)11-3-1-10(2-4-11)17(30)23-13(19(32)33)5-6-14(28)29/h1-4,9,12-13H,5-8H2,(H6-,21,22,23,24,25,28,29,30,31,32,33)/p-1/t12-,13+/m1/s1</p>
            <p>CHEBI: 57455</p>
            <p>HMDB: HMDB01354</p>
            <p>PUBCHEM: 46878371</p>
            <p>CAS: 7444-29-3</p>
            <p>METABOLIGHTS: MTBLC57455</p>
            <p>KEGG: C00445</p>
            <p>FORMULA: C20H20N7O6</p>
            <p>CHARGE: -1</p>
          </html>
        </notes>
      </species>

Thank you for your response!

cdiener commented 4 years ago

Looks like a problem with capturing the special characters. SBML often tried to convert special symbols by __[ASCII code]__. So __45__ is a dash (-) for instance. You can see this in the parsed expression. The dash between "METHENYL" and "THF" got parsed correctly but the dash at the beginning of the ID did not. The target ID is probably BIOCYC: 5-10-METHENYL-THF. The problem is the leading __ which confuses the parser. I think this might also be an issue in Ecocyc since that looks like a malformed ID but we should handle this better.

forestrywhale commented 4 years ago

This make lots of sense, thank you! I'll go change those ids and see what happens.

forestrywhale commented 4 years ago

Yes! I changed the "__45__"to "_" in the original file, and the problem solved. It would load no problem. Thank you all!

cdiener commented 4 years ago

Awesome! We'll probably keep that open because SBML round tripping should always be possible or give an error during parsing.

matthiaskoenig commented 4 years ago

I will have a look at the issue.

forestrywhale commented 4 years ago

I also found that whenever I use the function cobra.io.write_sbml_model to write the model, it would write __45__or __44__ to the new .xml file, which are not in the input model, specifically in all groups, such as:

  <groups:group groups:id="G_TYRSYN__44____32__PHESYN__44____32__ALL__45__CHORISMATE__45__PWY__44____32__COMPLETE__45__ARO__45__PWY__44____32__PWY__45__6628" groups:name="TYRSYN, PHESYN, ALL-CHORISMATE-PWY, COMPLETE-ARO-PWY, PWY-6628" groups:kind="collection">
    <groups:listOfMembers>
      <groups:member groups:name="CHORISMATEMUT-RXN" groups:idRef="R_CHORISMATEMUT_RXN"/>
    </groups:listOfMembers>
  </groups:group>

So far I am using .json file instead, but I thought it might be a good idea to post it here.

cdiener commented 4 years ago

This is kind of expected. SBML has strict rules which characters can be in an id so it will replace invalid characters by __[ASCII code]__. So , becomes __44__ for instance.

forestrywhale commented 4 years ago

Ah, I see the problem. I'll do preprocessing converts the "-" into "_" for groups. Thank you!

cdiener commented 4 years ago

955 is related to that, so maybe one could fix both in one go.