Closed zakandrewking closed 1 year ago
This will be solved as part of #684
This is a JSON import bug, not an SBML related bug. When reading the json the list of lists must be converted in correct annotations.
I also had this issue when I was trying to parse universal_model to sbml which still shows annotations in the form of list of lists. But when I checked the json_schema for annotations, it specify annotations to be of type object :
"annotation": {"type": "object"},
And performing a validation check over this json using the corresponding json_schema do points the error :
from jsonschema import validate
validate(json.load(file_handle), json_schema)
Error :
Traceback (most recent call last): File "issue736.py", line 110, in
ans = validate(json.load(file_handle), json_schema) File "/home/hemant/environment/cobrapyenv/lib/python3.7/site-packages/jsonschema/validators.py", line 934, in validate raise error jsonschema.exceptions.ValidationError: [['KEGG Compound', 'http://identifiers.org/kegg.compound/C01468'], ['CHEBI', 'http://identifiers.org/chebi/CHEBI:11981']] is not of type 'object' Failed validating 'type' in schema['properties']['metabolites']['items']['properties']['annotation']: {'type': 'object'}
On instance['metabolites'][0]['annotation']: [['KEGG Compound', 'http://identifiers.org/kegg.compound/C01468'], ['CHEBI', 'http://identifiers.org/chebi/CHEBI:11981']]
So I think json_scheme is upto date but the error exist with the models where annotations are list of lists. Because other models like e_coli_core where annotations values are json objects passes the validation test. So @matthiaskoenig , can you please help me understand how you want this issue to be solved. Should there be any change in the json_schema to allow annotations to be list of lists and further in _model_fromdict() method of cobra.io.dict.py where the parsing is done? Thanks.
A JSON object is not the same as an object in Python but rather a something like a dict (https://restfulapi.net/json-objects/). @zakandrewking when you said that the JSON schema allows annotations as list which source did you refer to? The one currently implemented in cobrapy does not seem to allow it.
I see this in a similar light to you, Christian. To me what should possibly happen in a next version of the JSON schema is:
An example of an annotation is then:
"annotation": {
"bigg.reaction": [["is", "PFK26"]],
"kegg.reaction": [["is", "R02732"]],
"rhea": [["is", "15656"]]
}
This looks good to me and has the advantage that is already valid under the current schema which makes it backwards compatible.
A JSON object is not the same as an object in Python but rather a something like a dict (https://restfulapi.net/json-objects/). @zakandrewking when you said that the JSON schema allows annotations as list which source did you refer to? The one currently implemented in cobrapy does not seem to allow it.
I just tested the steps from my original posting, and they are still the case with COBRA 0.17.1. It's not that the schema is wrong but that COBRApy doesn't stop you from using it this way.
Oh I see so the problem is that loading it does not throw an error because it should not pass the schema.
That was also one thing that confused me. Though the json_schema is defined in cobra.core.json.py file, it is nowhere used as a validator to validate the model which we are loading from the external json file. The _load_jsonmodel(filename) method of cobra.core.json.py directly parse the json into dictionary without any validation.The above error which I caught came when I myself checked the json against the json_schema. If validation is done while parsing the model from the json file, the 'list of lists' error or any other possible error will be caught there only instead of it being rising at the time of making of sbml or any other kind of model.
I have some local code that is unfortunately not quite ready. It uses pydantic models to define and validate the schema. Because it is a little more work it is also delaying #720.
@Hemant27031999 yeah I saw the the same thing. I think this is due to schema validation being really slow and reading JSON is currently really fast. Also most validation is performed in io/dict.py
on the dict so it can be used for yaml as well. For the annotations one easy way could be to adjust the Object
class directly and add a setter for annotations that only allows dicts. That looks like the safest option as it would propagate to all possible ways to get a model (de novo, from SBML, JSON, etc).
Yeah, I think that can be the best thing to do. Not only it will stop the complete validation check on the json model using the schema that can make the process slow, it will even catch the error of "list of lists" while reading the model from the json file. Can I work on this issue and submit a PR for the same if you say?
Sure go ahead. The Object
class is implemented in core/object.py
. Would be great if you could also add a test as well. You can add me as a reviewer for the PR.
I am working on my proposal for GSoC 2020 where I have to add the complete support for annotation in JSON format which, right now, have many issues. I have gone through the SBML level 3, version 2 specification doc for annotation in section 6 and have figured out following three feasible data format in SBML-JSON form.
SBML Annotation :
<annotation>
<rdf:RDF …(other namespaces)... >
<rdf:Description rdf:about="#G_b0351">
<bqbiol:is>
<rdf:Bag>
<rdf:li rdf:resource="http://identifiers.org/uniprot/P77580" />
<rdf:li rdf:resource="http://identifiers.org/uniprot/P0A6A3" />
</rdf:Bag>
</bqbiol:is>
<bqbiol:isEncodedBy>
<rdf:Bag>
<rdf:li rdf:resource="http://identifiers.org/asap/ABE-0001207" />
<rdf:li rdf:resource="http://identifiers.org/ecogene/EG13625" />
<rdf:li rdf:resource="http://identifiers.org/ncbigene/945008" />
<rdf:li rdf:resource="http://identifiers.org/ncbigene/945008" />
</rdf:Bag>
</bqbiol:isEncodedBy>
</rdf:Description>
</rdf:RDF>
</annotation>
Format 1 :
annotation = {
“is” : [“uniprot/P77580”, “uniprot/P0A6A3”],
“isEncodedBy” : [“asap/ABE-0001207”, ”ecogene/EG13625”,
“ncbigene/945008”, “ncbigene/945008”]
}
Format 2 :
annotation = {
“uniprot” : [(“is”, “P77580”), (“is”, “P0A6A3”)],
“asap” : [(“isEncodedBy” ,“ABE-0001207”)],
“ecogene” : [(“isEncodedBy”, “EG13625”)],
“ncbigene” : [(“isEncodedBy”,“945008”), (“isEncodedBy”,”945008”)]
}
Format 3:
annotation = {
“uniprot” : {“is” : [“P77580” , “P0A6A3”]},
“asap” : {“isEncodedBy” : [“ABE-0001207”]},
“ecogene” : {“isEncodedBy” : [“EG13625”]},
“ncbigene” : {“isEncodedBy” : [“945008”, ”945008”]}
}
Out of the above three formats, format 2 is somewhat similar to one discussed above (suggested by @Midnighter ) in above discussion. This format is good, but it has a repetition of biological identifier which makes it a little bit lengthy. Format 1 has identifier as key and list of resources as value. It looks a little more compact and also follow structure similar to one followed in SBML itself i.e resources listed under biological qualifier name. Format 3 is also compact but it looks a little scattered. So I think format 1 will be most suitable out of the three. I thought it would be better if I once discuss this with the members of cobrapy. Please have a look at it and let me know if it is fine or not. Thanks.
Awesome, thank you for putting in the extra effort here @Hemant27031999. I have opted for format 2 in the pydantic models created in cobra-component-models. In particular, you can look at the definition of the class for annotations which is used in the common base class for species, compartments, and reactions.
I opted for this slightly more verbose option because it is closest to the current JSON schema.
I'm not a fan of format 1 at all because separating namespace and identifier will require a lot of string manipulation and it is the biggest change from the existing format.
In my case, I just get rid of annotations completely:
for reaction in model.reactions: reaction.annotation = {}
This has derailed a bit into discussing the new annotation format. The original issue has been fixed by enforcing the schema.
Problem description
The JSON format allows annotations as lists of lists, but trying to export such a model as SBML causes an error.
This is the case for the BiGG Models universal model (http://bigg.ucsd.edu/data_access). If that model is incorrectly formatted, I can fix it, but it might be good to also add some warning in COBRApy.
Might be related to #580. Originally reported as https://github.com/SBRG/bigg_models/issues/299.
(Thanks!!!!)
Code Sample
Model:
Script:
Actual Output
Expected Output
Valid SBML model.
Dependency Information
Please paste the output of
python -c "from cobra import show_versions; show_versions()"
in the details block below.