petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Put 583 remaining compounds in Wikidata #5

Closed egonw closed 4 years ago

egonw commented 4 years ago

By creating a Bacting script that takes PubChem CIDs and adds the corresponding compounds to Wikidata.

egonw commented 4 years ago

@petermr, where can I find the list of 200 CIDs?

petermr commented 4 years ago

on https://github.com/gilienv/EssOilDB/tree/master/tables/chemistry/

It's a bit messy as we have split / forked the chemistry and the disambiguation and cleaning is going on there. Ambarish Kumar is doing a good job, but he's not working on CEV. See https://github.com/gilienv/EssOilDB/issues/76 which has 100 comments and look at the latest. It may be easiest just to download the tables. from /tables/chemistry/ I have posted an html table today.

Do you need access? I'll post something.

On Wed, Aug 28, 2019 at 11:27 AM Egon Willighagen notifications@github.com wrote:

@petermr https://github.com/petermr, where can I find the list of 200 CIDs?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/5?email_source=notifications&email_token=AAFTCS7XWSVT3XIPHLEOZF3QGZHJ5A5CNFSM4IRB6WNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5KUVEQ#issuecomment-525683346, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4F4VZAOEYDMGS2HALQGZHJ5ANCNFSM4IRB6WNA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

egonw commented 4 years ago

I need to part this until at least next week, I'm afraid. I got some urgent stuff to solve first :(

petermr commented 4 years ago

There's no rush

On Wed, Aug 28, 2019 at 2:57 PM Egon Willighagen notifications@github.com wrote:

I need to part this until at least next week, I'm afraid. I got some urgent stuff to solve first :(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/5?email_source=notifications&email_token=AAFTCSYS5FU2K7RVTTX3PS3QGZ76HA5CNFSM4IRB6WNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5LGZOQ#issuecomment-525757626, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS5AV663QZXN3XGGI4LQGZ76HANCNFSM4IRB6WNA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

egonw commented 4 years ago

Sorry, there are too many files in that folder... I have no idea at this moment how to see which compounds have not been found in Wikidata yet (and that I should add).

Help/suggestions welcome.

petermr commented 4 years ago

We have copied the 2112 EssoilDB compounds to CEVOpen. @Ambarish Kumar ambari73_sit@jnu.ac.in is working on them. We found far too many synonyms in PubChem and ChEBI so we've dropped those to about 300 which were found in EssoilDB 1.0

Ambarish , do we have a simple list of compounds in CEVOpen that do not have Wikidata entries?

On Tue, Sep 10, 2019 at 8:50 AM Egon Willighagen notifications@github.com wrote:

Sorry, there are too many files in that folder... I have no idea at this moment how to see which compounds have not been found in Wikidata yet (and that I should add).

Help/suggestions welcome.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/5?email_source=notifications&email_token=AAFTCS6OH3AOGXBURWACJM3QI5GSVA5CNFSM4IRB6WNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6KF3FY#issuecomment-529816983, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS6XPN6MJBIIXSTFHRDQI5GSVANCNFSM4IRB6WNA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

Yes sir.

Please check the list of compounds which does not has Wikidata Id.

notFoundWikidata.csv - https://github.com/petermr/CEVOpen/blob/master/notFoundWikidata.csv

Total number of records 583.

On Tue, Sep 10, 2019 at 4:11 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

We have copied the 2112 EssoilDB compounds to CEVOpen. @Ambarish Kumar ambari73_sit@jnu.ac.in is working on them. We found far too many synonyms in PubChem and ChEBI so we've dropped those to about 300 which were found in EssoilDB 1.0

Ambarish , do we have a simple list of compounds in CEVOpen that do not have Wikidata entries?

On Tue, Sep 10, 2019 at 8:50 AM Egon Willighagen notifications@github.com wrote:

Sorry, there are too many files in that folder... I have no idea at this moment how to see which compounds have not been found in Wikidata yet (and that I should add).

Help/suggestions welcome.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/5?email_source=notifications&email_token=AAFTCS6OH3AOGXBURWACJM3QI5GSVA5CNFSM4IRB6WNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6KF3FY#issuecomment-529816983, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS6XPN6MJBIIXSTFHRDQI5GSVANCNFSM4IRB6WNA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- AMBARISH KUMAR

M.Tech - 2014-2016 SC&IS ; Jawaharlal Nehru University, New Delhi, INDIA

+91 - 8377964303 ambari73_sit@jnu.ac.in http://goog_1594941533 er.ambarish@gmail.com er.ambarish@gmail.com

petermr commented 4 years ago

Thanks Ambarish, Egon does this look tractable?

(I suspect that some of these - especially the esters which may have missing spaces - are not what the authors intended, but that doesn't alter the validity of linking Wikidata to Pubchem - it just means they may not get used frequently.

On Tue, Sep 10, 2019 at 12:29 PM Ambarish Kumar ambari73_sit@jnu.ac.in wrote:

Yes sir.

Please check the list of compounds which does not has Wikidata Id.

notFoundWikidata.csv - https://github.com/petermr/CEVOpen/blob/master/notFoundWikidata.csv

Total number of records 583.

On Tue, Sep 10, 2019 at 4:11 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

We have copied the 2112 EssoilDB compounds to CEVOpen. @Ambarish Kumar ambari73_sit@jnu.ac.in is working on them. We found far too many synonyms in PubChem and ChEBI so we've dropped those to about 300 which were found in EssoilDB 1.0

Ambarish , do we have a simple list of compounds in CEVOpen that do not have Wikidata entries?

On Tue, Sep 10, 2019 at 8:50 AM Egon Willighagen < notifications@github.com> wrote:

Sorry, there are too many files in that folder... I have no idea at this moment how to see which compounds have not been found in Wikidata yet (and that I should add).

Help/suggestions welcome.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/5?email_source=notifications&email_token=AAFTCS6OH3AOGXBURWACJM3QI5GSVA5CNFSM4IRB6WNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6KF3FY#issuecomment-529816983, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS6XPN6MJBIIXSTFHRDQI5GSVANCNFSM4IRB6WNA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- AMBARISH KUMAR

M.Tech - 2014-2016 SC&IS ; Jawaharlal Nehru University, New Delhi, INDIA

+91 - 8377964303 ambari73_sit@jnu.ac.in http://goog_1594941533 er.ambarish@gmail.com er.ambarish@gmail.com

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

egonw commented 4 years ago

Yes, thanks!

egonw commented 4 years ago

Processing the file...

egonw commented 4 years ago

This looks promising :)

====================
C₁₇H₂₈O₂ is not yet in Wikidata
====================
====================
C₁₅H₂₀O is not yet in Wikidata
====================
====================
C₁₇H₂₄O₂ is not yet in Wikidata
====================
====================
C₁₆H₂₂O₂ is not yet in Wikidata
====================
====================
C₁₀H₁₆O is not yet in Wikidata
====================
====================
C₉H₁₆ is not yet in Wikidata
====================
====================
C₁₅H₂₆O is not yet in Wikidata
====================
====================
C₁₅H₂₂O is not yet in Wikidata
====================
====================
C₁₅H₂₄O is not yet in Wikidata
====================
====================
C₁₅H₂₆O is not yet in Wikidata
====================
egonw commented 4 years ago

And so does the next step! :) The first 10 missing entries are in (573 to go ;)

image

egonw commented 4 years ago

For the next batch, I do find hits in Wikidata, tho. But that's not a problem.

egonw commented 4 years ago

Okay, this is the workflow. On the above linked CSV file, I run this script: https://github.com/egonw/ons-wikidata/blob/master/EssOil/prepareInput.groovy This prepares the content for https://github.com/egonw/ons-wikidata/blob/master/Wikidata/createWDitemsFromSMILES.groovy which I run after that. The first (new) script fetches the SMILES for the compounds from PubChem.

egonw commented 4 years ago

I'm now doing the remaining batch: https://tools.wmflabs.org/quickstatements/#/batch/18772

petermr commented 4 years ago

Egon this is great We are writing a paper for mat todd and would be great to put all this in

On Wed, 18 Sep 2019, 15:59 Egon Willighagen, notifications@github.com wrote:

I'm now doing the remaining batch: https://tools.wmflabs.org/quickstatements/#/batch/18772

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/5?email_source=notifications&email_token=AAFTCS4IEXJXKENAXQDFU4LQKI64NA5CNFSM4IRB6WNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7ALXUY#issuecomment-532724691, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2NPURYQCK2RM2MNJ3QKI64NANCNFSM4IRB6WNA .

egonw commented 4 years ago

Hi all, so what is next for this issue?