Disambiguating chemistry and fixing typos

petermr commented 5 years ago

Chemical nomenclature is complex and ambiguous. Any attempt to disambiguate MUST record ambiguity. Thus acetyl-furan could be 1-acetyl-furan or 2-acetyl-furan, OPSIN (https://opsin.ch.cam.ac.uk) gives:

APPEARS_AMBIGUOUS: Connection of acet to furan

and this must be recorded

Always test with OPSIN.

petermr commented 4 years ago

On Tue, Aug 20, 2019 at 9:23 AM Ambarish Kumar notifications@github.com wrote:

Sir, Removed all comp+comp and comp/comp mixtures.

Good

The reason what I find behind PubChem generated NA entries corresponding to majority of compound names (used as an query input) is unavailability of their synonyms mentioned by depositor into the PubChem.

Resolving all remaining names using PubChem REST API.

Please go through the first 208 findings as first batch job - sheet https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/nameCleaningPubChemLookup.csv .

HAVE YOU RESOLVED ALL THE OTHERS? are there only 208 you cannot resolve?

70 entries are available into PubChem and rest are not searchable.

Example for not-retrieved entries are as follows. PubChem_lookup is generated after truncating the names. Is it a right way to get PubChem_lookup and retrieve compound_cid?

ID not-retrieved entries PubChem_lookup Compound_CID

C898 (E)-3-hexanoic acid HEXANOIC ACID 8892

This name is corrupt

C5 (E)-2,2-decenal not found NA

corrupt

C4 (E)-2,(Z)-6-decadienal 2,6-Decadienal 5283350

corrupt

C893 (E)-2-undecenol Undecenol 22506525

Why are you removing the "(E)-2-" string

C891 (E)-2-undecanal UNDECANAL 8186

corrupt

C799 (2)-3-hexenylacetate Cis-3-Hexenyl Acetate 5363388

corrupt

C800 (2)-3-hexenylbenzoate Cis-3-HEXENYLBENZOATE 32809

corrupt

Please summarise clearly exactly how may compounds you started with, and why you discarded them. A flow diagram is very useful here.

Here's an example for papers, You can do the same for compounds https://www.researchgate.net/figure/Bibliography-search-PRISMA-This-figure-represents-the-methodology-applied-to-screen_fig1_327298098

Other aspects DO NOT translate alpha to .alpha. I have already mentioned this.

your "clean" column should only be capitalization, dashes, whitespace and replacement of corrupt unicode characters. NEVER try to translate chemistry. That must be left to chemical experts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS25FC4UQWKV4QGVD6LQFOSXLA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4VPPUA#issuecomment-522909648, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS55QAPVZFYXQOMQQKTQFOSXLANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir, Reverted all made changes to -alpha-, -beta-, -gamma- and -delta-notations. Rest all changes are as stated.

 C891       (E)-2-undecanal                   UNDECANAL                     8186

Cited example is for the used search entry which needs to truncate isomeric or other nomenclature notations (sterio-isomeric - (E) or (Z) OR many-a-times functional group position) for getting PubChem search result.

First the first batch job, I started with 220 compounds with clean_name. 80 of them generate search results with their available or reported synonyms, out of them I am selecting the best search. Flow-diagram. All records are into the table. Sir, non are discarded.

dignamecleaning

petermr commented 4 years ago

This cannot be right.

Is the 220 a sample?? where does it come from?? The top box should be ca. 7500

The bottom line is meaningless. The boxes have the same labels. And they don't sum to the box above.

You must be more precise with your terminology. At present it's meaningless.

On Tue, Aug 20, 2019 at 11:04 AM Ambarish Kumar notifications@github.com wrote:

Sir, Reverted all made changes to -alpha-, -beta-, -gamma- and -delta-notations. Rest all changes are as stated.

C891 (E)-2-undecanal UNDECANAL 8186

Cited example is for the used search entry which needs to truncate isomeric or other nomenclature notations (sterio-isomeric - (E) or (Z) OR many-a-times functional group position) for getting PubChem search result.

First the first batch job, I started with 220 compounds with clean_name. 80 of them generate search results with their available or reported synonyms, out of them I am selecting the best search. Flow-diagram. All records are into the table. Sir, non are discarded.

[image: dignamecleaning] https://user-images.githubusercontent.com/36997739/63338196-fb7b9300-c35f-11e9-8b90-44026c4f81b9.jpg

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS7EPYFU2XFMRH5NOZLQFO6S3A5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4VYMWQ#issuecomment-522946138, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS76IGBLMNRR7JV3IILQFO6S3ANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir, This is the flow diagram with corrected levels.

newflowdig

220 is the compound list for first batch job to get search results from PubChem REST API. This is the list of first 220 compounds from the list of compounds which did not generate CIDs from PubChem Identifier Exchange Services.

petermr commented 4 years ago

On Tue, Aug 20, 2019 at 11:58 AM Ambarish Kumar notifications@github.com wrote:

Sir, This is the flow diagram with corrected levels.

[image: newflowdig] https://user-images.githubusercontent.com/36997739/63341730-801ddf80-c367-11e9-8c85-0452d1e86d46.jpg

Label the files (compound20190816.tsv - 7171 entries (or is it 7170?)) After excluding 252 from 7171 you should have a labelled box with 7171 - 252= 6919 compounds. And so on. Detail each step:

remove non chemical names (I see you have "remove" as a name and there are several others)
syntax correction (delete some whitespace, normalize characters). This probably doesn't affect the count
then you should submit to pubchem and give the count of the entries that passed and failed. At each stage they should add up to the previous.

Hopefully then you have a list of compounds that passed compound20190822_resolvedPC.tsv) and some that failed. Create a table of those that failed. If it's larger that 500 then there is a problem. Put it in a separate file with a dated label (compound20190822_unresolvedPC.tsv)

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir, Restructured the flow-diagram and add combined as well as separate sheet for found and not found search entries using PubChem identifiers exchange services.

flowDIG

petermr commented 4 years ago

Thank you. I still don't understand why there are 4200 compounds that can't be resolved. I've spotted a few obvious problems:

still some with "+", i.e. mixture
contain "unknown" remove these in the first pass....

There are also a LOT of names that are corrupt or underspecified.

In fact it looks like more than half the names in EssoilDB are corrupt or underspecified

This does not mean that half the compounds in the profies are underspecified/corrupt as there will be many compounds which occur many times. Hmm... I think we have to discard the 4200 names (we can keep the profiles that they occur in - but we cannot give a chemical formula to some of the components. If they only occur once then they are a small percentage (4200/142,000 = 3%) There is no doubt that machine extraction of the literature will be more reliable. This was harder than I thought. For some reason there are fewer currupt plant names. This is then our draft version of the chemical table - 2971 names. Now we should decide how many are unique (e.g. C170 3-hydroxy-2-butanone 3-hydroxy-2-butanone 179 C2171 3-hydroxybutan-2-one 3-hydroxybutan-2-one 179 C2779 acetoin acetoin 179 are all the same compound.

P.

Sir,

Restructured the flow-diagram and add combined https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/originalNamePlusCleanName.csv as well as separate sheet for found https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/foundOriginalNamePlusCleanName20190821.csv and not found https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/notFoundOriginalNamePlusCleanName20190821.csv search entries using PubChem identifiers exchange services.

Where is the flow diagram?

—

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS2VLSDKQYC7SM5NKMLQFUVEVA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4ZNRFQ#issuecomment-523425942, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS6RMTTTGPNNHVDZSNTQFUVEVANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Yes sir, all three are same compounds - CID - 179.

flowDig.

petermr commented 4 years ago

The diagram needs more refinement. Diverging lines represent separations. Also try to make names shorter than 15 characters. I'd suggest:

raw20190816.tsv (7170)	clean syntax

clean20190816.tsv | PubChem loookup / \ resolved20190816.tsv unresolved20190816.tsv (2974) (4196)

I don't think there is any point in removing mixtures as Pubchem won't resolve them and there is so much corruption it's not worth trying to do more.

then maybe resolved20190820.tsv (2974)	remove synonyms

uniqueCompounds20190821.tsv

where there is only one CID (or InChI).

We also need to extract the InChIs.

On Wed, Aug 21, 2019 at 5:25 PM Ambarish Kumar notifications@github.com wrote:

Yes sir, all three are same compounds - CID - 179.

flowDig https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/flowDIG.jpg .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSYWICJ7OCTM3FQP2UTQFVT7RA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42INJI#issuecomment-523536037, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS254ZXMTLWQI44KQG3QFVT7RANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

I have removed duplicate CIDs and this gives 2112 rows so uniqueCompounds could be the final table and resolved will contain synonyms. The actual name in unique will be arbitrary at this stage.

@mannyrules this shows that we need to check compounds for uniqueness when they are ingested. P.

petermr commented 4 years ago

I think the final chemical/compound step is to link the good unique compounds (not names) (ca 2112) back to the profile records. An immediate task is to find the distribution of compounds - which are the commonest? That will be really useful for the next steps as we will have "most" of the compounds in essential oils.

In principle this table and the plant table can then be used to validate new input. This would mean that any name which wasn't in the synonym table could be checked when it was ingested - always the best time.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir,

newFlowDiagram.

newFlowDiagram

raw20190822.tsv clean20190822.tsv resolved20190822.tsv unresolved20190822.tsv uniqueCompounds20190822.tsv

newFlowDiagram - with uniform date-stamp to each file.

newFlowDiagram_1

petermr commented 4 years ago

I have done an experiment of looking up name via pubchem API.

The compounds are in table 1 of PMC5248495 (I haven't committed those papers but I will)

Here's a typical script:

#! /bin/sh

echo 1-octen-3-ol
curl https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/1-octen-3-ol/cids/XML
echo 1-8-Cineole
curl https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/1-8-Cineole/cids/XML
echo %28Z%29-beta-Ocimene
curl https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/%28Z%29-beta-Ocimene/cids/XML
echo gamma-Terpinene
curl https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/gamma-Terpinene/cids/XML

Note that the names must be escaped as:

all greek characters spelt out
all punctuation URL-escaped (space=> %20, "("=>%28, etc.)

The results may have zero, 1 or many CIDs retrieved. Here is a typical result (I have massaged the o/p to make it readable). I add comments 

<compounds>
<Compound name="1-octen-3-ol">
  <CID id="18827"/> <!-- single CID no problem -->
</Compound>
<Fault name="1-8-Cineole"/> <!-- not found because it should be `1,8` -->
<Compound name="%28Z%29-beta-Ocimene"> <!-- note escaping -->
  <CID id="5320250"/>
</Compound>
<Compound name="gamma-Terpinene">
  <CID id="7461"/>
</Compound>
<Fault name="Fenhone"/> <!-- misspelt in article -->
<Compound name="Linalool">
  <CID id="6549"/>
</Compound>
<Compound name="Camphor">
  <CID id="2537"/>
</Compound>
<Compound name="alpha-Terpineol">
  <CID id="17100"/>
</Compound>
<Compound name="Methyl%20chavicol"> <!-- space escaped -->
  <CID id="8815"/>
</Compound>
<Compound name="Nerol">
  <CID id="643820"/>
</Compound>
<Compound name="Neral">
  <CID id="643779"/>
</Compound>
<Compound name="Geraniol">
  <CID id="637566"/>
</Compound>
<Compound name="Geranial">
  <CID id="638011"/>
</Compound>
<Compound name="Bornyl%20acetate"> <!-- many synonyms, I think  because of substitutents -->
  <CID id="6448"/>
  <CID id="93009"/>
  <CID id="442460"/>
  <CID id="6950274"/>
  <CID id="443131"/>
  <CID id="3034424"/>
  <CID id="12097317"/>
  <CID name="44630108"/>
  <CID name="57505377"/>
</Compound>
<Compound name="Neryl%20acetate">
  <CID id="1549025"/>
</Compound>
<Compound name="Methyl%20cinnamate">
  <CID id="637520"/>
</Compound>
<Compound name="beta-Elemene">
  <CID id="6918391"/>
</Compound>
<Compound name="beta-Caryophyllene">
  <CID id="5281515"/>
</Compound>
<Fault name="beta–Copaene"/> <!-- I don;t know why this isn't found -->
<Compound name="trans-alpha-Bergamotene">
  <CID id="6429302"/>
</Compound>
<Compound name="alpha-Humulene">
  <CID id="5281520"/>
</Compound>
<Compound name="cis-beta-Farnesene">
  <CID id="5317319"/>
</Compound>
<Compound name="Germacrene%20d">
    <CID id="5317570"/>
    <CID id="5373727"/>
  <CID id="6436582"/>
  <CID id="91104"/>
  <CID id="49796490"/>
  <CID id="91723653"/>
</Compound>
<Compound name="beta-Cubebene">
  <CID id="93081"/>
</Compound>
<Compound name="alpha-Bulnesene">
  <CID id="94275"/>
</Compound>
<Fault name="alpha-Amorphen"/> <!-- misspelt -->
<Compound name="delta-Cadinene">
  <CID id="441005"/>
</Compound>
<Compound name="Aromadendrene">
    <CID id="91354"/>
  <CID id="11095734"/>
  <CID id="12305243"/>
  <CID id="91746456"/>
</Compound>
<Compound name="Spathulenol">
    <CID id="92231"/>
  <CID id="522266"/>
  <CID id="6432640"/>
  <CID id="97032059"/>
  <CID id="13854255"/>
</Compound>
<Compound name="Caryophyllene%20oxide">
  <CID id="1742210"/>
</Compound>
<Fault name="alpha–Bisabolene"/> <!-- don't know why -->
<Fault name="beta-Bisabolenene"/> <!-- typo -->
<Compound name="alpha-Bisabolol">
    <CID id="442343"/>
   <CID id="1549992"/>
  <CID id="1201551"/>
  <CID id="10586"/>
  <CID id="6506009"/>
</Compound>
</compounds>

ambarishK commented 4 years ago

Let me go through it sir.

petermr commented 4 years ago

Your flow diagram is starting to look good. I would colour the operations (e.g. clean, lookup) with a different colour and also embed them in the line . So move "clean" and "Remove" so the line passes through them.

How are you doing the lookup? In the same way as me?

P.

ambarishK commented 4 years ago

Sir, I am using web API for name lookup and to check for it's availability into the PubChem. For CID retrieval I use PubChem identifier exchange services.

petermr commented 4 years ago

Please give the URLs and examples of the RESTful APIs.

On Thu, Aug 22, 2019 at 7:52 PM Ambarish Kumar notifications@github.com wrote:

Sir, I am using web API for name lookup and to check for it's availability into the PubChem. For CID retrieval I use PubChem identifier exchange services.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS76HAM6OXK2HABFPNLQF3OAFA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD46BLWI#issuecomment-524031449, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYXOVTTEIYS7ESVGBDQF3OAFANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

https://pubchem.ncbi.nlm.nih.gov/ - Used for name lookup and checking for the availability of compound name into the repository.

https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi - PubChem identifier exchange services.

PubChem identifier exchange services and PUG REST API performs equally well.

PUG REST documentation

Example for the PubChem identifiers exchange services - PubChem services PubChemidenfiersexchangeservices

In case of batch retrieval, browse for the .csv file containing list if compound names.

I found it easier than PUG REST API as it does not ask for replacing white-space or parentheses with appropriate notations like %20, %28 or %29.

Both services performs equally well as I passed-on the unresolved compound names to both of them (after placing notations for white-space, parentheses to PUG REST API) generated results are similar.

For example -

C4995      iso-borneol

PubChem identifier exchange services                               PUG REST API
                                                                                           url - (https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/iso-borneol/cids/xml)
         Result set is empty                                                  <Message>No CID found</Message>

C5044     isobonyl acetate

         Result set is empty                                                     url - https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/isobonyl%20acetate/cids/xml)
                                                                                       <Message>No CID found</Message>          

C828       (4Z)-decenal

        Result set is empty                                                      url - 

                                           (https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/%284Z%29-decenal/cids/xml)
                                                                                      <Message>No CID found</Message>

Sir, after syntax correction to the next pass, resolved unique compound names (generating CID) are only 35.

(Z)-9-octadecen-1-ol    5284499
cadinol 6428423
caryophyllene   5281515
cymene  7463
gurjunene   15560275
muurolene   12306047
phellandrene    7460
thujene 520384
1-epi-cubenol   519857
alpha-bergamotene   86608
alpha-calacorene    12302243
alpha-copaene   70678558
alpha-selinene  10856614
beta-bisabolene 10104370
beta-bourbonene 62566
beta-selinene   442393
carotol 442347
caryophyllene oxide 1742210
cis-alpha-bisabolene    5352653
cryptomerione   11964091
cubenol 11770062
dehydroaromadendrene    91746711
delta-cadinene  441005
isospathulenol  14038848
mustakone   12313013
spathulenol 92231
trans-caryophyllene 5281515
alpha-acoradiene    90351
beta-Copaene    57339298
E-beta-ocimene  5281553
epi-alpha-muurolol  3084331
Germacrene-B    15559495
Germacrene-D    91723653
selin-11-en-4alpha-ol   15560330
trans-alpha-bergamotene 6429302

Trying using curl for remaining unresolved compound names. Forming script for that.

petermr commented 4 years ago

Thank you, This is clear. I didn't realise you were using the identifier exchange and that it managed multiple names/lookup. Good. Where did these compounds come from? We need to standardise on test sets...

But I think we need to make this more systematic. At present what I'd like to do is come up with a resolved list of compounds (e.g. the 2112 set) AND their counts in EssoilDB. Can you do this?

ambarishK commented 4 years ago

Sir, these compounds are from the unresolved20190822.tsv.

Yes sir, I can get the count of each entry of the resolved list of compounds into EssoilDB.

ambarishK commented 4 years ago

Sir, go through the frequency count of each resolved compound into EssoilDB - sheet

Column description -

Compound_identifier - unique identifier assigned to each compounds.
original_name - Original name of the compounds into the database.
clean_name - cleaned name of compounds.
cid - compound CID.
freq - Frequency count of each compound into EssoilDB.

petermr commented 4 years ago

Thanks, This looks very useful

On Fri, Aug 23, 2019 at 12:12 PM Ambarish Kumar notifications@github.com wrote:

Sir, go through the frequency count of each resolved compound into EssoilDB - sheet https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/resolveCompFreqCount.tsv

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS5MPHRLPX2JEEMNUZ3QF7AZPA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD475BKY#issuecomment-524275883, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS22RP2GNS2K4IMJDS3QF7AZPANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Welcome sir.

petermr commented 4 years ago

The next steps are to get

InChI
wikidata into the same table.

On Fri, Aug 23, 2019 at 12:26 PM Ambarish Kumar notifications@github.com wrote:

Welcome sir.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZGIX2643M5KPW4LVTQF7CPXA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD476C3Y#issuecomment-524280175, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYHNNNNK6NCB3O5YJTQF7CPXANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

There are 17 compounds that have zero (0) count. I am not surprised - it's a problem of the lookup and could be caused by small lexical problems (e.g. non-unicode or spaces). I am not worried.

On Fri, Aug 23, 2019 at 12:35 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

The next steps are to get

InChI

wikidata into the same table.

On Fri, Aug 23, 2019 at 12:26 PM Ambarish Kumar notifications@github.com wrote:

Welcome sir.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZGIX2643M5KPW4LVTQF7CPXA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD476C3Y#issuecomment-524280175, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYHNNNNK6NCB3O5YJTQF7CPXANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Yes sir.

ambarishK commented 4 years ago

Running a script for ami-dictionary to get WIKIDATA ID. Also, rectifying '0' count for compounds into the EssoilDB (which occurred due to UTF-8 encoding).

petermr commented 4 years ago

OK, You might also wish to use RDF lookup for the unique set of PubChem identifiers. Less fragile than names.

On Sat, Aug 24, 2019 at 3:57 PM Ambarish Kumar notifications@github.com wrote:

Running a script https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/runWDID.sh for ami-dictionary to get WIKIDATA ID. Also, rectifying '0' count for compounds into the EssoilDB (which occurred due to UTF-8 encoding).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSYWY55KNQH22WPM2YDQGFD6NA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CBXCI#issuecomment-524557193, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS3V5MUZXMPPD7RPGRTQGFD6NANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

And it may be a good idea to batch this into chunks of (say) 200 compounds in case you need to restart.

On Sat, Aug 24, 2019 at 6:44 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

OK, You might also wish to use RDF lookup for the unique set of PubChem identifiers. Less fragile than names.

On Sat, Aug 24, 2019 at 3:57 PM Ambarish Kumar notifications@github.com wrote:

Running a script https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/runWDID.sh for ami-dictionary to get WIKIDATA ID. Also, rectifying '0' count for compounds into the EssoilDB (which occurred due to UTF-8 encoding).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSYWY55KNQH22WPM2YDQGFD6NA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CBXCI#issuecomment-524557193, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS3V5MUZXMPPD7RPGRTQGFD6NANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir, script is running over lab Desktop. I do not think there would be interruption in run.

Rectified zero(0) count of compounds reported from the database. It was due to UTF-8 encoding. Also, there is a column for InChI. sheet

I will add the WIKIDATA_ID column in morning. Script is running over Desktop.

Sir, how to extract wikidata informations from RDF lookup?As I am getting open to semantic world, I think I should go through SPARQL query language. I am finding subject, object and predicate into RDF lookup for PubChem compounds.

petermr commented 4 years ago

On Sat, Aug 24, 2019 at 8:00 PM Ambarish Kumar notifications@github.com wrote:

Sir, script is running over lab Desktop. I do not think there would be interruption in run.

OK.

Rectified zero(0) count of compounds reported from the database. It was due to UTF-8 encoding. Also, there is a column for InChI. sheet https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/resolveCompFreqCount.tsv

Good

I will add the WIKIDATA_ID column in morning. Script is running over Desktop.

Good

Sir, how to extract wikidata informations from RDF lookup? I think I should go through SPARQL query language. I am finding subject and predicate into RDF lookup for PubChem compounds.

Pubchem ID is Property (P662) So yes, just use RDF/SPARQL as in the tutorial

This is looking good.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS66OUXLXIEDGTLENR3QGGALFA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CFV4Y#issuecomment-524573427, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYJRJHPHC3ZPN6FVU3QGGALFANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Setting new run for getting WIKIDATA ID. Splitted compound names into 20 small chunks each of 100 compounds each.

Error for batch run of 2114 compounds.

Generic values (AMIDictionaryTool)
================================
basename            null
cproject            
ctree               
cTreeList           null
dryrun              false
excludeBase         null
excludeTrees        null
file types          []
forceMake           false
includeBase         null
includeTrees        null
log4j               
logfile             null
verbose             0

Specific values (AMIDictionaryTool)
================================
dataCols      null
dictionary    [compounds]
dictionaryTop     compounds
hrefCols      null
input         null
informat      null
dictInformat  null
linkCol       null
log4j         null
nameCol       null
operation     create
outformats    [xml]
splitCol      ,
termCol       null
terms         null
wikiLinks     [wikipedia, wikidata]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!.!!!!!!!!!!!!!!.!!!.!!!!!!!!!!!.!!!!.....!!!!!.!!.!.!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!...!!!..!!!!!!!!!!!!.!!!!!!!!!.!!!!!!!..!!!!!.!!.!!!!!.!.!!...!!!!!!!.!!!!.!!!..!.!.!!!!..!!!!!!.!!!.!!!!!!!.!!!!!!!!!!!!!.!!!!.!!!!.!!.!!.!!!!!!!!......!!!!!!!.!.!!.!!!!!!!!!!!!!!!!..!!!!!!!!!..!!!!!!!!!!!!!.!!!.!!!!!!!!!!.!!!!!!!!..!.!!!!!!!!!..!.!!!!!!!![Fatal Error] :89:3725: The element type "a" must be terminated by the matching end-tag "</a>".
Exception in thread "main" java.lang.RuntimeException: nu.xom.ParsingException: The element type "a" must be terminated by the matching end-tag "</a>". at line 89, column 3725
    at org.contentmine.eucl.xml.XMLUtil.parseXML(XMLUtil.java:395)
    at org.contentmine.ami.lookups.WikipediaLookup.getHtmlBodyFromUrl(WikipediaLookup.java:391)
    at org.contentmine.ami.lookups.WikipediaLookup.getWikidataHtmlBody(WikipediaLookup.java:379)
    at org.contentmine.ami.lookups.WikipediaLookup.queryWikidata(WikipediaLookup.java:416)
    at org.contentmine.ami.tools.AMIDictionaryTool.addWikiLinks(AMIDictionaryTool.java:755)
    at org.contentmine.ami.tools.AMIDictionaryTool.createDictionaryListInRandomOrder(AMIDictionaryTool.java:736)
    at org.contentmine.ami.tools.AMIDictionaryTool.addEntriesToDictionaryElement(AMIDictionaryTool.java:717)
    at org.contentmine.ami.tools.AMIDictionaryTool.writeNamesAndLinks(AMIDictionaryTool.java:685)
    at org.contentmine.ami.tools.AMIDictionaryTool.createDictionary(AMIDictionaryTool.java:524)
    at org.contentmine.ami.tools.AMIDictionaryTool.runDictionary(AMIDictionaryTool.java:409)
    at org.contentmine.ami.tools.AMIDictionaryTool.runSpecifics(AMIDictionaryTool.java:398)
    at org.contentmine.ami.tools.AbstractAMITool.runCommands(AbstractAMITool.java:218)
    at org.contentmine.ami.tools.AMIDictionaryTool.main(AMIDictionaryTool.java:362)
Caused by: nu.xom.ParsingException: The element type "a" must be terminated by the matching end-tag "</a>". at line 89, column 3725
    at nu.xom.Builder.build(Unknown Source)
    at nu.xom.Builder.build(Unknown Source)
    at org.contentmine.eucl.xml.XMLUtil.parseXML(XMLUtil.java:392)
    ... 12 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 89; columnNumber: 3725; The element type "a" must be terminated by the matching end-tag "</a>".
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    ... 15 more
runWDID.sh: line 2: $'\r': command not found
runWDID.sh: line 3: $'\r': command not found
runWDID.sh: line 4: $'\r': command not found

petermr commented 4 years ago

Thanks,

Difficult to work out what the problem is without more details.

what are you trying to do (1 sentence)
what is your input?
has any of it worked? Can you create a simple test I can run?

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir, I am getting WIKIDATA ID for compound names using ami-dictionary.

For batch run, a script runs ami-dictionary for 2114 compound names as search terms inputs.

I splitted the script into 21 small chunks. Each containing 100 compound names.

Each splitted run is generating output, except one. Example script is for separate 1 to 21 batch runs ( each batch is for 100 compound names). 6th batch run is producing error. And rest are running well.

petermr commented 4 years ago

On Sun, Aug 25, 2019 at 12:40 PM Ambarish Kumar notifications@github.com wrote:

Sir, I am getting WIKIDATA ID for compound names using ami-dictionary.

As a batch run, a script runs ami-dictionary for 2114 compound names as search terms inputs.

I splitted the script into 21 small chunks. Each containing 100 compound names.

Good design

Each splitted run is generating output, except one. Example script https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/runWDIDbatch1-21.sh is for first 1 to 21 batch runs ( each batch is for 100 compound names). 6th batch run is producing error. And rest are running well.

So the classic way is a (binary) chop. Split the 6th group into (say) 10 groups (6.0 ... 6.9) Run them all. One will fail (say 6.3) then split 6.3 into 6.3.0 ... 6.3.9 and we'll know where the problem is.

—

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZW72O7F7K2BTB3H73QGJVSRA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CR25Q#issuecomment-524623222, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4RVJ37O336HEEEF6DQGJVSRANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir,

Run error is due to 4-methylacetophenone.

Compiled all batch run results to get WIKIDATA ID.

Renaming resolveCompFreqCount.tsv to resolveCompFreqCountInChIWD.tsv.

Column description for the sheet - resolveCompFreqCountInChIWD.tsv is as follows.

Compound_identifier - unique compound identifier.
original_name - original compound name present into the EssoilDB1.0.
clean_name - cleaned name of compounds.
cid - compound cid (retrieved from PubChem).
Freq - frequency count of each compound into essential oil profile of plants reported into the Essoildb1.0.
InChI - InChIs of compound (retrieved from PubChem)
wikidata - WIKIDATA identifier.
name - WIKIDATA lookup name for each compound.
description - description of each compound for wikidata lookup.
wikipedia - WIKIPEDIA query string.

name and description column is left only for check for disambiguation. As WIKIDATA ID are disambiguated, these will be dropped down.

Listed disambiguations into WIKIDATA ID search are as follows.

C6956       ni
C6282       sa2                    cell line
C6283       sa3                    Wikimedia disambiguation page
C6395       sh1                    Wikipedia disambiguation page
C6396       sh2                    InterPro Domain
C6397       sh3                    InterPro Domain
C6398       sh4                    Wikipedia disambiguation page

1389 WIKIDATA ID. 395 query string for WIKIPEDIA link.

petermr commented 4 years ago

On Mon, Aug 26, 2019 at 9:18 AM Ambarish Kumar notifications@github.com wrote:

Sir,

-

Run error is due to 4-methylacetophenone.

Well done to identify this. I'll have a browse to see what the problem is. Maybe it has some blank fields

-

Compiled all batch run results to get WIKIDATA ID.

*Renaming resolveCompFreqCountInChI.tsv to resolveCompFreqCountInChIWD.tsv .

Column description for the sheet - resolveCompFreqCountInChIWD.tsv https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/resolveCompFreqCountInChIWD.tsv is as follows.

Compound_identifier - unique compound identifier.

original_name - original compound name present into the EssoilDB1.0.

clean_name - cleaned name of compounds.

cid - compound cid (retrieved from PubChem).

Freq - frequency count of each compound into essential oil profile of plants reported into the Essoildb1.0.

InChI - InChIs of compound (retrieved from PubChem)

wikidata - WIKIDATA identifier.

name - WIKIDATA lookup name for each compound.

description - description of each compound for wikidata lookup.

wikipedia - WIKIPEDIA query string.

name and description column is left only for check for disambiguation. As WIKIDATA ID are disambiguated, these will be dropped down.

Listed disambiguations into WIKIDATA ID search are as follows.

C6956 ni

I don't know what this is. I am sure it's not the element Ni. It's probably somethinng like "No Information". OMIT

C6282 sa2 cell line C6283 sa3 Wikimedia disambiguation page C6395 sh1 Wikipedia disambiguation page C6396 sh2 InterPro Domain C6397 sh3 InterPro Domain C6398 sh4 Wikipedia disambiguation page

These are NOT compounds, so OMIT

1389 WIKIDATA ID. 395 query string for WIKIPEDIA link.

Thanks. Can you also generate the short InChI? I'll explain ARE YOU FREE FOR A SKYPE?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS42S7JDVUYIT746WXLQGOGUHA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5DU5MA#issuecomment-524766896, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS635FNOISLK26PMW3TQGOGUHANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Yes sir. I at hangout right now.

petermr commented 4 years ago

2 minutes...

On Mon, Aug 26, 2019 at 12:01 PM Ambarish Kumar notifications@github.com wrote:

Yes sir. I at hangout right now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS4PYPHW7TOQGG6OHFDQGOZYPA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5EBGNA#issuecomment-524817204, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4GQV5AT4EIDIDZAPTQGOZYPANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

The problems you are having are due to the lookup process in Wikidata. When you look for C758 by its NAME (-)-alloaromadendrene you find all objects with alloaromadendrene in. That includes articles! Like Essential Oil Alloaromadendrene from Mixed-Type Cinnamomum osmophloeum Leaves Prolongs the Lifespan in Caenorhabditis elegans So we have to search for the compound itself using the PubChem ID. I think the best things is to delete H and I (name and description) columns and rerun with SPARQL for PubChemID

P.

On Mon, Aug 26, 2019 at 11:55 AM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

On Mon, Aug 26, 2019 at 9:18 AM Ambarish Kumar notifications@github.com wrote:

Sir,

-

Run error is due to 4-methylacetophenone.

Well done to identify this. I'll have a browse to see what the problem is. Maybe it has some blank fields

-

Compiled all batch run results to get WIKIDATA ID.

*Renaming resolveCompFreqCountInChI.tsv to resolveCompFreqCountInChIWD.tsv.

Column description for the sheet - resolveCompFreqCountInChIWD.tsv https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/resolveCompFreqCountInChIWD.tsv is as follows.

Compound_identifier - unique compound identifier.

original_name - original compound name present into the EssoilDB1.0.

clean_name - cleaned name of compounds.

cid - compound cid (retrieved from PubChem).

Freq - frequency count of each compound into essential oil profile of plants reported into the Essoildb1.0.

InChI - InChIs of compound (retrieved from PubChem)

wikidata - WIKIDATA identifier.

name - WIKIDATA lookup name for each compound.

description - description of each compound for wikidata lookup.

wikipedia - WIKIPEDIA query string.

name and description column is left only for check for disambiguation. As WIKIDATA ID are disambiguated, these will be dropped down.

Listed disambiguations into WIKIDATA ID search are as follows.

C6956 ni

I don't know what this is. I am sure it's not the element Ni. It's probably somethinng like "No Information". OMIT

C6282 sa2 cell line C6283 sa3 Wikimedia disambiguation page C6395 sh1 Wikipedia disambiguation page C6396 sh2 InterPro Domain C6397 sh3 InterPro Domain C6398 sh4 Wikipedia disambiguation page

These are NOT compounds, so OMIT

1389 WIKIDATA ID. 395 query string for WIKIPEDIA link.

Thanks. Can you also generate the short InChI? I'll explain ARE YOU FREE FOR A SKYPE?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS42S7JDVUYIT746WXLQGOGUHA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5DU5MA#issuecomment-524766896, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS635FNOISLK26PMW3TQGOGUHANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

OK sir. I am going through the SPARQL queries.

petermr commented 4 years ago

PubChem is slightly messy - I have been tweeting. If there is a Wikidata entry (e.g. Camphor) it links to the EN language Wikipedia. If there's no WP but there is a Wikidata it links to that but it's still labelled Wikipedia (wrongly).

So the safest approach is name ==pubchem=> Pubchem CID PubchemID ==pubchem=> Wikipedia link (page) CID ==wikidataSparql=> wikidata ID

P.

On Mon, Aug 26, 2019 at 12:04 PM Ambarish Kumar notifications@github.com wrote:

OK sir.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSY24VMGUFFZOIXRJH3QGO2DVA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5EBMII#issuecomment-524817953, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS57Q2Z6T6RY7AU3DWTQGO2DVANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir, Please check for the updated table - resolveCompTable20190827.tsv containing cleaned compound names, compound cid, InChIs, InChIKey and WIKIDATA id.

A short briefing to generate the WIKIDATA id based on compound cid using SPARQL query is as follows.

Step 1 Run SPARQL query.

SELECT DISTINCT ?compound ?compoundLabel ?cid WHERE { ?compound wdt:P662 ?cid SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} }

- Step 2

Download all retrieved records of `?compounds` - as WIKIDATA entry and their `?cid` - PubChem compound CIDs.

- Step 3

Map the WIKIDATA id to the EssoilDB compound table using compound CIDs.

- RScript to map records of SPARQL query results to EssoilDB compound tables.

compTable<-read.csv("E:/resolveCompTable.csv")

sparqlResults<-("E:/sparqlResults.csv")

resolveCompTable20190827<-merge(compTable, sparqlResults, by="cid", all.x = TRUE)

write.csv(resolveCompTable20190827,"E:/resolveCompTable20190827.csv")

PubChem has WIKIDATA property `P662`. SPARQL query retrieves all compounds and their CIDs. `?compound` is used as subject, `wdt:P662` is used as predicate and `?cid` as object into the query.

Additionally one may go for retrieving ChEBI identifiers, KEGG identifiers Chemspider identifier, compound formula, InChIKey and CAS number (available into WIKIDATA) and so on as per the SPARQL query.

SELECT DISTINCT ?compound ?compoundName ?cas ?formula ?compoundLabel ?inchikey ?chemspider ?pubchem ?chebi ?KEGG_id WHERE { ?compound wdt:P662 ?cid . OPTIONAL { ?compound wdt:P231 ?cas . } OPTIONAL { ?compound wdt:P274 ?formula . } OPTIONAL { ?compound wdt:P235 ?inchikey . } OPTIONAL { ?compound wdt:P661 ?chemspider . } OPTIONAL { ?compound wdt:P662 ?pubchem . } OPTIONAL { ?compound wdt:P683 ?chebi .} OPTIONAL { ?compound wdt:P665 ?KEGG_id .} SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }

}



InChIKey is added using PubChem identifier exchange services using PubChem compound CID.
Count of retrieved WIKIDATA identifiers for EssoilDB comounds is `1317`

[SPARQL query editor](https://query.wikidata.org/)

petermr commented 4 years ago

On Tue, Aug 27, 2019 at 7:55 AM Ambarish Kumar notifications@github.com wrote:

Sir, Please check for the updated table - resolveCompTable20190827.tsv https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/resolveCompTable20190827.tsv containing cleaned compound names, compound cid, InChIs, InChIKey and WIKIDATA id.

Thanks, Looks good. Manny and I will need to go through the chemistry and remove false positives (e.g. SA3 )

we need a LINK in the TSV to a chemical structure diagram, either Pubchem or Wikidata, so w can eyeball the structures and remove the FPs.

A short briefing to generate the WIKIDATA id based on compound cid is as follows.

I can't see this. Probably add this as a comment in the Issue

—

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSYA5KZ5G2Y37E53NQ3QGTFWZA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5GWOYA#issuecomment-525166432, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2MNBOGGRDNVCT6L3DQGTFWZANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir, Please go through the edited previous issue. All steps are mentioned into it.

petermr commented 4 years ago

Thanks - my bad, I was reading the email, not the Issues.

On Tue, Aug 27, 2019 at 9:45 AM Ambarish Kumar notifications@github.com wrote:

Sir, Please go through the edited previous issue. All steps are mentioned into it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS4GL4CHBSXKMIA6IV3QGTSTLA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5G7WZI#issuecomment-525204325, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS7TX27JKQHITUJMWWTQGTSTLANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir, chemical structure diagram can be fetched using PubChem downloads services. Alternately, using SPARQL query.

petermr commented 4 years ago

Maybe we should download 2.1K images into the compound directory. Pubchem is more standardised than Wikidata, but some of the diagrams are awful. Then we can link directly without having to go through the WWW.

petermr commented 4 years ago

Comment: Wikidata does not have entries for over 100 compounds. I'll find out how to add them from PubChem.

ambarishK commented 4 years ago

OK sir.

gilienv / EssOilDB

Disambiguating chemistry and fixing typos #76

-

-