Integrate other databases

jacobwindsor commented 7 years ago

Currently, this only ranks through PUBCHEM's API. It would be nice to use other APIs to rank compounds. Then probably rename this project too. We would have to discuss how other databases are implemented. I.e. simply rank by the total number of "hits" across all databases, or allow filtering of search parameters, who knows. Probably the algorithm needs to be a bit more complex to get an accurate indication of the amount of data available for each compound in the dataset

Other databases (please add):

Google scholar (would have to pay a little bit)
Scifinder (don't think it's possible)
WikiData
Brenda (http://www.brenda-enzymes.org/)
KEGG (although Pubchem already uses KEGG)
HMDB
WikiPathways
ChEBI (https://www.ebi.ac.uk/chebi/)

jacobwindsor commented 7 years ago

Some more databases:

And can use this one to get SMILES from IUPAC

jacobwindsor commented 7 years ago

After some preliminary research it seems MetaCyc is the easiest to add since they have a REST API. They even have a nice service to search for foreign keys (e.g. PubChem or KEGG), see here.

However, the only issue is that you have to search on an organism specific basis. The url to search is something like:

http://websvc.biocyc.org/[ORGID]/foreignid?ids=[DATABASE-NAME]:[FOREIGNID]

Where ORGID is the organism ID.

@DeniseSl22 Is it okay to make the ranker only usable for human datasets for now? It should be easy to add other organisms in the future. However, bare in mind that the more databases are added, the harder it will be to keep the organism restriction broad since some databases may support fewer organisms.

DeniseSl22 commented 7 years ago

Yeah sure. Is PubChem then searched for humans only as well? Perhaps we can add a option in the future where people can say which organism they want to filter on ;)

DeniseSl22 commented 7 years ago

Oh btw; Egon just told me there is a new service (I will get the details through mail) which allows automated search through articles (for a lot of publishers, not Elsevier). Perhaps we can do something with that as well (I remembered you told me that a specific search through literature was really missing when you guys were looking at the VOCs dataset)

DeniseSl22 commented 7 years ago

Here the info from Egon: CrossRef API (citation counts): https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md EuroPubMedCentral API: http://europepmc.org/RestfulWebService#cites Initiative for Open Citations: https://i4oc.org/

jacobwindsor commented 7 years ago

Hmm cool! CrossRef I guess is the most well known so can integrate that first.

jacobwindsor commented 7 years ago

Using MetaCyc, the flow is:

1) Get the MetaCyc ID using the PubChem ID with https://metacyc.org/META/foreignid?ids=PUBCHEM:<id>&fmt=json 2) Retrieve the set of MetaCyc objects concerning that compound withhttp://websvc.biocyc.org/apixml?fn=[API-FUNCTION]&id=[ORGID]:[OBJECT-ID]&detail=[none|low|full]`

The second step is what is needed to be discussed. What information do we actually want to retrieve from MetaCyc? If you see here, there is quite a lot we can do.

The obvious ones are:

pathways-of-compound
reactions-of-compound

But, there are some others in this list that could be interesting. Potentially, you can go however deep you like - getting the ID required for the next query from the previous query.

all-products-of-gene
binding-site-transcription-factors
chromosome-of-gene
compounds-of-pathway
containers-of
containing-tus
direct-activators
direct-inhibitors
enzymes-of-gene
enzymes-of-pathway
enzymes-of-reaction
genes-of-pathway
genes-of-protein
genes-of-reaction
genes-regulated-by-gene
genes-regulating-gene
modified-containers
modified-forms
monomers-of-protein
pathways-of-compound
pathways-of-gene
reactions-of-compound
reactions-of-enzyme
reactions-of-gene
regulator-proteins-of-transcription-unit
regulon-of-protein
substrates-of-reaction
top-containers
transcription-unit-activators
transcription-unit-binding-sites
transcription-unit-genes
transcription-unit-inhibitors
transcription-unit-mrna-binding-sites
transcription-unit-promoter
transcription-unit-terminators
transcription-unit-transcription-factors
transcription-units-of-gene
transcription-units-of-protein

@egonw and @deniseSl22 could you provide some input?

egonw commented 7 years ago

I would go to number of pathways and number of substrates...

DeniseSl22 commented 7 years ago

Hi Jacob,

Just found some info on the ChEBI website that they have an API.... Perhaps useful to add this to the Ranker Program?

https://www.ebi.ac.uk/chebi/libchebi.do

jacobwindsor commented 7 years ago

Oh wow! How did I not see that?

For my reference: here's the API library for Python

DeniseSl22 commented 7 years ago

?Yeah I am a awesome googler :p

Kind regards,

Denise Slenter MSc UNS50 H1.302

T: +316-50585586

From: Jacob Windsor notifications@github.com Sent: Tuesday, May 16, 2017 11:25 To: jacobwindsor/pubchem-ranker Cc: Slenter Denise (BIGCAT); Mention Subject: Re: [jacobwindsor/pubchem-ranker] Integrate other databases (#6)

Oh wow! How did I not see that?

For my reference: here's the API library for Pythonhttps://github.com/libChEBI/libChEBIpy

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/jacobwindsor/pubchem-ranker/issues/6#issuecomment-301726125, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AZD3yJtY7r2-TyRShp0SX3Cx0okuCjHEks5r6WuDgaJpZM4Maj8J.

DeniseSl22 commented 7 years ago

Oh and another one I can across (HMDB API): https://github.com/mzmine/mzmine2/issues/195

I think you didn't look at this, cause Egon already checked if the compounds were in HMBD and ChEBI (which a lot f them weren't). SO, this could help other people to find which compounds they do not have to investigate any further :)

jacobwindsor / pubchem-ranker

Integrate other databases #6