Choice of Annotation Databases

ChristianLieven commented 6 years ago

Who defines the set of "common databases"? Why not taking any database registered in the Identifiers.org registry? That way, we are certain the annotations will resolve. doi:10.1093/nar/gkr1097

Nicolas Le Novere

As a starter, I have taken the liberty to define a set of databases that I have seen used in many models. However, I am inviting you to make further suggestions, but I should add that I will be somewhat opposed to testing for the presence of annotations to all 619 collections within the registry. It would certainly make the runtime explode :) Let's try to find a small but common/ broad selection of databases that we would like models to reference. The databases I picked are indeed all registered in Identifiers.org. Here is my selection for both metabolites and reactions: https://puu.sh/yxj5i/004f8bbc24.png I will make an effort to include this list more transparently in the report or in the documentation!

My response

The problem you will face is that none of those database will cover everything we need. Some database are more comprehensive (KEGG, BRENDA) and some database ar more accurate (Reactome, SABIO-RK). Also some people like different databases. For instance, people coming from the kinetic modeling side want(ed) KEGG, people coming from the FBA side want BiGG, people caring about accuracy tend to favor ChEBI. Moreover databases come and go. And finally constraint based modelling does not leave in isolation from the rest of computational biology. There are higher authority deciding which repository to fund in a reliable way (e.g. BD2K, ELIXIR) that impact long term maintenance and therefore usability. At the end of the day, it is just a question of mapping. In SBML, one can add any number of reactions. What we probably want for reconstruction is tables (database or spreadsheet) that contain the reactions, the chemicals and the genes. Those tables can be used for mapping between databases. This is just a technical issue, easy to solve. Give me ChEBI IDs, I'll give you KEGG, CAS or InCHIs.

Nicolas Le Novere

matthiaskoenig commented 6 years ago

Here my opinion:

Only support open, public databases which are accessible without licensing fees (i.e. drop things like KEGG, BioCyc, HumanCyc).
For species ChEBI is all you need (it has great crosslinks to all other data bases, is well maintained, and has a very fast turnover of updates via a github tracker. I.e. if something is missing or incorrect a fixed version is available online within weeks).
Reactions is much more complicated: Personally I like RHEA very much, but lacks a bit of coverage and especially integration with other databases via crosslinks. Reactome is also very good choice, as well asthe BiGG database. With BiGG I am not sure how there coverage is besides microorganisms. There is only a very old RECON1, so not sure if all the reactions of RECON2 are in there, which is mainly what I would need more many applications.

Edit:

Proteins: UniProt
Genes: Ensemble

Midnighter commented 6 years ago

Basically, what we require from identifiers.org is only the database identifier and the respective regular expressions for metabolites, genes, and reactions. This information could easily be read from a flat file. So I would go with a small selection of databases defined by us and any number of additional user defined databases in future. By that I mean, users can configure this information for their own test runs.

djinnome commented 6 years ago

BiGG is great for curated FBA models, but as a metabolite database, it sucks because there are no chemical structures associated with each metabolite, which means that mapping BiGG compounds requires an authoritative source such as MetaNetX. The full BioCyc Tier 1-3 DB collection is freely and openly available to all users without fees or license restrictions

matthiaskoenig commented 6 years ago

You have ChEBI annotations in BiGG which directly give you the structures (is literally one web service call away). In my opinion much better to link to a highly curated chemical structure database like ChEBI which is based on an ontology and is fully open source with open licenses) than to BioCyc .

Also I am personally very critical about relying on Metacyc which is highly restricted via their license agreement and has a track record of moving things behind expensive subscription models. In my personal opinion it is only a matter of time until you have to pay for MetaCyc also. Building an open infrastructure with resources like MetaCyc and KEGG is just not feasible. They have great content but license and subscription models make them a no go for me.

siddC commented 6 years ago

Agreed. MeteCyc is quite restrictive and relying on them seems to run counter towards memote's goals of creating community-driven, open source software.

opencobra / memote

Choice of Annotation Databases #332