Open jacobwindsor opened 7 years ago
Some more databases:
And can use this one to get SMILES from IUPAC
After some preliminary research it seems MetaCyc is the easiest to add since they have a REST API. They even have a nice service to search for foreign keys (e.g. PubChem or KEGG), see here.
However, the only issue is that you have to search on an organism specific basis. The url to search is something like:
http://websvc.biocyc.org/[ORGID]/foreignid?ids=[DATABASE-NAME]:[FOREIGNID]
Where ORGID is the organism ID.
@DeniseSl22 Is it okay to make the ranker only usable for human datasets for now? It should be easy to add other organisms in the future. However, bare in mind that the more databases are added, the harder it will be to keep the organism restriction broad since some databases may support fewer organisms.
Yeah sure. Is PubChem then searched for humans only as well? Perhaps we can add a option in the future where people can say which organism they want to filter on ;)
Oh btw; Egon just told me there is a new service (I will get the details through mail) which allows automated search through articles (for a lot of publishers, not Elsevier). Perhaps we can do something with that as well (I remembered you told me that a specific search through literature was really missing when you guys were looking at the VOCs dataset)
Here the info from Egon: CrossRef API (citation counts): https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md EuroPubMedCentral API: http://europepmc.org/RestfulWebService#cites Initiative for Open Citations: https://i4oc.org/
Hmm cool! CrossRef I guess is the most well known so can integrate that first.
Using MetaCyc, the flow is:
1) Get the MetaCyc ID using the PubChem ID with https://metacyc.org/META/foreignid?ids=PUBCHEM:<id>&fmt=json 2) Retrieve the set of MetaCyc objects concerning that compound with
http://websvc.biocyc.org/apixml?fn=[API-FUNCTION]&id=[ORGID]:[OBJECT-ID]&detail=[none|low|full]`
The second step is what is needed to be discussed. What information do we actually want to retrieve from MetaCyc? If you see here, there is quite a lot we can do.
The obvious ones are:
But, there are some others in this list that could be interesting. Potentially, you can go however deep you like - getting the ID required for the next query from the previous query.
@egonw and @deniseSl22 could you provide some input?
I would go to number of pathways and number of substrates...
Hi Jacob,
Just found some info on the ChEBI website that they have an API.... Perhaps useful to add this to the Ranker Program?
Oh wow! How did I not see that?
For my reference: here's the API library for Python
?Yeah I am a awesome googler :p
Kind regards,
Denise Slenter MSc UNS50 H1.302
T: +316-50585586
From: Jacob Windsor notifications@github.com Sent: Tuesday, May 16, 2017 11:25 To: jacobwindsor/pubchem-ranker Cc: Slenter Denise (BIGCAT); Mention Subject: Re: [jacobwindsor/pubchem-ranker] Integrate other databases (#6)
Oh wow! How did I not see that?
For my reference: here's the API library for Pythonhttps://github.com/libChEBI/libChEBIpy
- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/jacobwindsor/pubchem-ranker/issues/6#issuecomment-301726125, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AZD3yJtY7r2-TyRShp0SX3Cx0okuCjHEks5r6WuDgaJpZM4Maj8J.
Oh and another one I can across (HMDB API): https://github.com/mzmine/mzmine2/issues/195
I think you didn't look at this, cause Egon already checked if the compounds were in HMBD and ChEBI (which a lot f them weren't). SO, this could help other people to find which compounds they do not have to investigate any further :)
Currently, this only ranks through PUBCHEM's API. It would be nice to use other APIs to rank compounds. Then probably rename this project too. We would have to discuss how other databases are implemented. I.e. simply rank by the total number of "hits" across all databases, or allow filtering of search parameters, who knows. Probably the algorithm needs to be a bit more complex to get an accurate indication of the amount of data available for each compound in the dataset
Other databases (please add):