Open petermr opened 5 years ago
On Tue, Aug 20, 2019 at 9:23 AM Ambarish Kumar notifications@github.com wrote:
Sir, Removed all comp+comp and comp/comp mixtures.
Good
The reason what I find behind PubChem generated NA entries corresponding to majority of compound names (used as an query input) is unavailability of their synonyms mentioned by depositor into the PubChem.
Resolving all remaining names using PubChem REST API.
Please go through the first 208 findings as first batch job - sheet https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/nameCleaningPubChemLookup.csv .
HAVE YOU RESOLVED ALL THE OTHERS? are there only 208 you cannot resolve?
70 entries are available into PubChem and rest are not searchable.
Example for not-retrieved entries are as follows. PubChem_lookup is generated after truncating the names. Is it a right way to get PubChem_lookup and retrieve compound_cid?
ID not-retrieved entries PubChem_lookup Compound_CID
C898 (E)-3-hexanoic acid HEXANOIC ACID 8892
This name is corrupt
C5 (E)-2,2-decenal not found NA
corrupt
C4 (E)-2,(Z)-6-decadienal 2,6-Decadienal 5283350
corrupt
C893 (E)-2-undecenol Undecenol 22506525
Why are you removing the "(E)-2-" string
C891 (E)-2-undecanal UNDECANAL 8186
corrupt
C799 (2)-3-hexenylacetate Cis-3-Hexenyl Acetate 5363388
corrupt
C800 (2)-3-hexenylbenzoate Cis-3-HEXENYLBENZOATE 32809
corrupt
Please summarise clearly exactly how may compounds you started with, and why you discarded them. A flow diagram is very useful here.
Here's an example for papers, You can do the same for compounds https://www.researchgate.net/figure/Bibliography-search-PRISMA-This-figure-represents-the-methodology-applied-to-screen_fig1_327298098
Other aspects DO NOT translate alpha to .alpha. I have already mentioned this.
your "clean" column should only be capitalization, dashes, whitespace and replacement of corrupt unicode characters. NEVER try to translate chemistry. That must be left to chemical experts.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS25FC4UQWKV4QGVD6LQFOSXLA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4VPPUA#issuecomment-522909648, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS55QAPVZFYXQOMQQKTQFOSXLANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir, Reverted all made changes to -alpha-, -beta-, -gamma- and -delta-notations. Rest all changes are as stated.
C891 (E)-2-undecanal UNDECANAL 8186
Cited example is for the used search entry which needs to truncate isomeric or other nomenclature notations (sterio-isomeric - (E) or (Z) OR many-a-times functional group position) for getting PubChem search result.
First the first batch job, I started with 220 compounds with clean_name. 80 of them generate search results with their available or reported synonyms, out of them I am selecting the best search. Flow-diagram. All records are into the table. Sir, non are discarded.
This cannot be right.
Is the 220 a sample?? where does it come from?? The top box should be ca. 7500
The bottom line is meaningless. The boxes have the same labels. And they don't sum to the box above.
You must be more precise with your terminology. At present it's meaningless.
On Tue, Aug 20, 2019 at 11:04 AM Ambarish Kumar notifications@github.com wrote:
Sir, Reverted all made changes to -alpha-, -beta-, -gamma- and -delta-notations. Rest all changes are as stated.
C891 (E)-2-undecanal UNDECANAL 8186
Cited example is for the used search entry which needs to truncate isomeric or other nomenclature notations (sterio-isomeric - (E) or (Z) OR many-a-times functional group position) for getting PubChem search result.
First the first batch job, I started with 220 compounds with clean_name. 80 of them generate search results with their available or reported synonyms, out of them I am selecting the best search. Flow-diagram. All records are into the table. Sir, non are discarded.
[image: dignamecleaning] https://user-images.githubusercontent.com/36997739/63338196-fb7b9300-c35f-11e9-8b90-44026c4f81b9.jpg
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS7EPYFU2XFMRH5NOZLQFO6S3A5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4VYMWQ#issuecomment-522946138, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS76IGBLMNRR7JV3IILQFO6S3ANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir, This is the flow diagram with corrected levels.
220 is the compound list for first batch job to get search results from PubChem REST API. This is the list of first 220 compounds from the list of compounds which did not generate CIDs from PubChem Identifier Exchange Services.
On Tue, Aug 20, 2019 at 11:58 AM Ambarish Kumar notifications@github.com wrote:
Sir, This is the flow diagram with corrected levels.
[image: newflowdig] https://user-images.githubusercontent.com/36997739/63341730-801ddf80-c367-11e9-8c85-0452d1e86d46.jpg
Label the files (compound20190816.tsv - 7171 entries (or is it 7170?)) After excluding 252 from 7171 you should have a labelled box with 7171 - 252= 6919 compounds. And so on. Detail each step:
Hopefully then you have a list of compounds that passed compound20190822_resolvedPC.tsv) and some that failed. Create a table of those that failed. If it's larger that 500 then there is a problem. Put it in a separate file with a dated label (compound20190822_unresolvedPC.tsv)
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Thank you. I still don't understand why there are 4200 compounds that can't be resolved. I've spotted a few obvious problems:
There are also a LOT of names that are corrupt or underspecified.
In fact it looks like more than half the names in EssoilDB are corrupt or underspecified
This does not mean that half the compounds in the profies are underspecified/corrupt as there will be many compounds which occur many times. Hmm... I think we have to discard the 4200 names (we can keep the profiles that they occur in - but we cannot give a chemical formula to some of the components. If they only occur once then they are a small percentage (4200/142,000 = 3%) There is no doubt that machine extraction of the literature will be more reliable. This was harder than I thought. For some reason there are fewer currupt plant names. This is then our draft version of the chemical table - 2971 names. Now we should decide how many are unique (e.g. C170 3-hydroxy-2-butanone 3-hydroxy-2-butanone 179 C2171 3-hydroxybutan-2-one 3-hydroxybutan-2-one 179 C2779 acetoin acetoin 179 are all the same compound.
P.
Sir,
Restructured the flow-diagram and add combined https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/originalNamePlusCleanName.csv as well as separate sheet for found https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/foundOriginalNamePlusCleanName20190821.csv and not found https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/notFoundOriginalNamePlusCleanName20190821.csv search entries using PubChem identifiers exchange services.
Where is the flow diagram?
—
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS2VLSDKQYC7SM5NKMLQFUVEVA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4ZNRFQ#issuecomment-523425942, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS6RMTTTGPNNHVDZSNTQFUVEVANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
The diagram needs more refinement. Diverging lines represent separations. Also try to make names shorter than 15 characters. I'd suggest:
raw20190816.tsv (7170) | clean syntax |
---|
clean20190816.tsv | PubChem loookup / \ resolved20190816.tsv unresolved20190816.tsv (2974) (4196)
I don't think there is any point in removing mixtures as Pubchem won't resolve them and there is so much corruption it's not worth trying to do more.
then maybe resolved20190820.tsv (2974) | remove synonyms |
---|
uniqueCompounds20190821.tsv
where there is only one CID (or InChI).
We also need to extract the InChIs.
On Wed, Aug 21, 2019 at 5:25 PM Ambarish Kumar notifications@github.com wrote:
Yes sir, all three are same compounds - CID - 179.
flowDig https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/flowDIG.jpg .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSYWICJ7OCTM3FQP2UTQFVT7RA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42INJI#issuecomment-523536037, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS254ZXMTLWQI44KQG3QFVT7RANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
I have removed duplicate CIDs and this gives
2112 rows
so uniqueCompounds could be the final table and resolved will contain synonyms. The actual name in unique
will be arbitrary at this stage.
@mannyrules this shows that we need to check compounds for uniqueness when they are ingested. P.
I think the final chemical/compound step is to link the good unique compounds (not names) (ca 2112) back to the profile records. An immediate task is to find the distribution of compounds - which are the commonest? That will be really useful for the next steps as we will have "most" of the compounds in essential oils.
In principle this table and the plant table can then be used to validate new input. This would mean that any name which wasn't in the synonym table could be checked when it was ingested - always the best time.
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir,
raw20190822.tsv clean20190822.tsv resolved20190822.tsv unresolved20190822.tsv uniqueCompounds20190822.tsv
newFlowDiagram - with uniform date-stamp to each file.
I have done an experiment of looking up name via pubchem API.
The compounds are in table 1 of PMC5248495 (I haven't committed those papers but I will)
Here's a typical script:
#! /bin/sh
echo 1-octen-3-ol
curl https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/1-octen-3-ol/cids/XML
echo 1-8-Cineole
curl https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/1-8-Cineole/cids/XML
echo %28Z%29-beta-Ocimene
curl https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/%28Z%29-beta-Ocimene/cids/XML
echo gamma-Terpinene
curl https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/gamma-Terpinene/cids/XML
Note that the names must be escaped as:
The results may have zero, 1 or many CIDs retrieved. Here is a typical result (I have massaged the o/p to make it readable). I add comments <!-- ... -->
<compounds>
<Compound name="1-octen-3-ol">
<CID id="18827"/> <!-- single CID no problem -->
</Compound>
<Fault name="1-8-Cineole"/> <!-- not found because it should be `1,8` -->
<Compound name="%28Z%29-beta-Ocimene"> <!-- note escaping -->
<CID id="5320250"/>
</Compound>
<Compound name="gamma-Terpinene">
<CID id="7461"/>
</Compound>
<Fault name="Fenhone"/> <!-- misspelt in article -->
<Compound name="Linalool">
<CID id="6549"/>
</Compound>
<Compound name="Camphor">
<CID id="2537"/>
</Compound>
<Compound name="alpha-Terpineol">
<CID id="17100"/>
</Compound>
<Compound name="Methyl%20chavicol"> <!-- space escaped -->
<CID id="8815"/>
</Compound>
<Compound name="Nerol">
<CID id="643820"/>
</Compound>
<Compound name="Neral">
<CID id="643779"/>
</Compound>
<Compound name="Geraniol">
<CID id="637566"/>
</Compound>
<Compound name="Geranial">
<CID id="638011"/>
</Compound>
<Compound name="Bornyl%20acetate"> <!-- many synonyms, I think because of substitutents -->
<CID id="6448"/>
<CID id="93009"/>
<CID id="442460"/>
<CID id="6950274"/>
<CID id="443131"/>
<CID id="3034424"/>
<CID id="12097317"/>
<CID name="44630108"/>
<CID name="57505377"/>
</Compound>
<Compound name="Neryl%20acetate">
<CID id="1549025"/>
</Compound>
<Compound name="Methyl%20cinnamate">
<CID id="637520"/>
</Compound>
<Compound name="beta-Elemene">
<CID id="6918391"/>
</Compound>
<Compound name="beta-Caryophyllene">
<CID id="5281515"/>
</Compound>
<Fault name="beta–Copaene"/> <!-- I don;t know why this isn't found -->
<Compound name="trans-alpha-Bergamotene">
<CID id="6429302"/>
</Compound>
<Compound name="alpha-Humulene">
<CID id="5281520"/>
</Compound>
<Compound name="cis-beta-Farnesene">
<CID id="5317319"/>
</Compound>
<Compound name="Germacrene%20d">
<CID id="5317570"/>
<CID id="5373727"/>
<CID id="6436582"/>
<CID id="91104"/>
<CID id="49796490"/>
<CID id="91723653"/>
</Compound>
<Compound name="beta-Cubebene">
<CID id="93081"/>
</Compound>
<Compound name="alpha-Bulnesene">
<CID id="94275"/>
</Compound>
<Fault name="alpha-Amorphen"/> <!-- misspelt -->
<Compound name="delta-Cadinene">
<CID id="441005"/>
</Compound>
<Compound name="Aromadendrene">
<CID id="91354"/>
<CID id="11095734"/>
<CID id="12305243"/>
<CID id="91746456"/>
</Compound>
<Compound name="Spathulenol">
<CID id="92231"/>
<CID id="522266"/>
<CID id="6432640"/>
<CID id="97032059"/>
<CID id="13854255"/>
</Compound>
<Compound name="Caryophyllene%20oxide">
<CID id="1742210"/>
</Compound>
<Fault name="alpha–Bisabolene"/> <!-- don't know why -->
<Fault name="beta-Bisabolenene"/> <!-- typo -->
<Compound name="alpha-Bisabolol">
<CID id="442343"/>
<CID id="1549992"/>
<CID id="1201551"/>
<CID id="10586"/>
<CID id="6506009"/>
</Compound>
</compounds>
Let me go through it sir.
Your flow diagram is starting to look good. I would colour the operations (e.g. clean, lookup) with a different colour and also embed them in the line . So move "clean" and "Remove" so the line passes through them.
How are you doing the lookup? In the same way as me?
P.
Sir, I am using web API for name lookup and to check for it's availability into the PubChem. For CID retrieval I use PubChem identifier exchange services.
Please give the URLs and examples of the RESTful APIs.
On Thu, Aug 22, 2019 at 7:52 PM Ambarish Kumar notifications@github.com wrote:
Sir, I am using web API for name lookup and to check for it's availability into the PubChem. For CID retrieval I use PubChem identifier exchange services.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS76HAM6OXK2HABFPNLQF3OAFA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD46BLWI#issuecomment-524031449, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYXOVTTEIYS7ESVGBDQF3OAFANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
https://pubchem.ncbi.nlm.nih.gov/ - Used for name lookup and checking for the availability of compound name into the repository.
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi - PubChem identifier exchange services.
PubChem identifier exchange services and PUG REST API performs equally well.
Example for the PubChem identifiers exchange services - PubChem services
In case of batch retrieval, browse for the .csv file containing list if compound names.
I found it easier than PUG REST API as it does not ask for replacing white-space or parentheses with appropriate notations like %20, %28 or %29.
Both services performs equally well as I passed-on the unresolved compound names to both of them (after placing notations for white-space, parentheses to PUG REST API) generated results are similar.
For example -
C4995 iso-borneol
PubChem identifier exchange services PUG REST API
url - (https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/iso-borneol/cids/xml)
Result set is empty <Message>No CID found</Message>
C5044 isobonyl acetate
Result set is empty url - https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/isobonyl%20acetate/cids/xml)
<Message>No CID found</Message>
C828 (4Z)-decenal
Result set is empty url -
(https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/%284Z%29-decenal/cids/xml)
<Message>No CID found</Message>
Sir, after syntax correction to the next pass, resolved unique compound names (generating CID) are only 35.
(Z)-9-octadecen-1-ol 5284499
cadinol 6428423
caryophyllene 5281515
cymene 7463
gurjunene 15560275
muurolene 12306047
phellandrene 7460
thujene 520384
1-epi-cubenol 519857
alpha-bergamotene 86608
alpha-calacorene 12302243
alpha-copaene 70678558
alpha-selinene 10856614
beta-bisabolene 10104370
beta-bourbonene 62566
beta-selinene 442393
carotol 442347
caryophyllene oxide 1742210
cis-alpha-bisabolene 5352653
cryptomerione 11964091
cubenol 11770062
dehydroaromadendrene 91746711
delta-cadinene 441005
isospathulenol 14038848
mustakone 12313013
spathulenol 92231
trans-caryophyllene 5281515
alpha-acoradiene 90351
beta-Copaene 57339298
E-beta-ocimene 5281553
epi-alpha-muurolol 3084331
Germacrene-B 15559495
Germacrene-D 91723653
selin-11-en-4alpha-ol 15560330
trans-alpha-bergamotene 6429302
Trying using curl for remaining unresolved compound names. Forming script for that.
Thank you, This is clear. I didn't realise you were using the identifier exchange and that it managed multiple names/lookup. Good. Where did these compounds come from? We need to standardise on test sets...
But I think we need to make this more systematic. At present what I'd like to do is come up with a resolved list of compounds (e.g. the 2112 set) AND their counts in EssoilDB. Can you do this?
Sir, these compounds are from the unresolved20190822.tsv.
Yes sir, I can get the count of each entry of the resolved list of compounds into EssoilDB.
Sir, go through the frequency count of each resolved compound into EssoilDB - sheet
Column description -
Thanks, This looks very useful
On Fri, Aug 23, 2019 at 12:12 PM Ambarish Kumar notifications@github.com wrote:
Sir, go through the frequency count of each resolved compound into EssoilDB - sheet https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/resolveCompFreqCount.tsv
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS5MPHRLPX2JEEMNUZ3QF7AZPA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD475BKY#issuecomment-524275883, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS22RP2GNS2K4IMJDS3QF7AZPANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Welcome sir.
The next steps are to get
On Fri, Aug 23, 2019 at 12:26 PM Ambarish Kumar notifications@github.com wrote:
Welcome sir.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZGIX2643M5KPW4LVTQF7CPXA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD476C3Y#issuecomment-524280175, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYHNNNNK6NCB3O5YJTQF7CPXANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
There are 17 compounds that have zero (0) count. I am not surprised - it's a problem of the lookup and could be caused by small lexical problems (e.g. non-unicode or spaces). I am not worried.
On Fri, Aug 23, 2019 at 12:35 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:
The next steps are to get
- InChI
- wikidata into the same table.
On Fri, Aug 23, 2019 at 12:26 PM Ambarish Kumar notifications@github.com wrote:
Welcome sir.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZGIX2643M5KPW4LVTQF7CPXA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD476C3Y#issuecomment-524280175, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYHNNNNK6NCB3O5YJTQF7CPXANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Yes sir.
Running a script for ami-dictionary to get WIKIDATA ID. Also, rectifying '0' count for compounds into the EssoilDB (which occurred due to UTF-8 encoding).
OK, You might also wish to use RDF lookup for the unique set of PubChem identifiers. Less fragile than names.
On Sat, Aug 24, 2019 at 3:57 PM Ambarish Kumar notifications@github.com wrote:
Running a script https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/runWDID.sh for ami-dictionary to get WIKIDATA ID. Also, rectifying '0' count for compounds into the EssoilDB (which occurred due to UTF-8 encoding).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSYWY55KNQH22WPM2YDQGFD6NA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CBXCI#issuecomment-524557193, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS3V5MUZXMPPD7RPGRTQGFD6NANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
And it may be a good idea to batch this into chunks of (say) 200 compounds in case you need to restart.
On Sat, Aug 24, 2019 at 6:44 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:
OK, You might also wish to use RDF lookup for the unique set of PubChem identifiers. Less fragile than names.
On Sat, Aug 24, 2019 at 3:57 PM Ambarish Kumar notifications@github.com wrote:
Running a script https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/runWDID.sh for ami-dictionary to get WIKIDATA ID. Also, rectifying '0' count for compounds into the EssoilDB (which occurred due to UTF-8 encoding).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSYWY55KNQH22WPM2YDQGFD6NA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CBXCI#issuecomment-524557193, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS3V5MUZXMPPD7RPGRTQGFD6NANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir, script is running over lab Desktop. I do not think there would be interruption in run.
Rectified zero(0) count of compounds reported from the database. It was due to UTF-8 encoding. Also, there is a column for InChI. sheet
I will add the WIKIDATA_ID column in morning. Script is running over Desktop.
Sir, how to extract wikidata informations from RDF lookup?As I am getting open to semantic world, I think I should go through SPARQL query language. I am finding subject, object and predicate
into RDF lookup for PubChem compounds.
On Sat, Aug 24, 2019 at 8:00 PM Ambarish Kumar notifications@github.com wrote:
Sir, script is running over lab Desktop. I do not think there would be interruption in run.
OK.
Rectified zero(0) count of compounds reported from the database. It was due to UTF-8 encoding. Also, there is a column for InChI. sheet https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/resolveCompFreqCount.tsv
Good
I will add the WIKIDATA_ID column in morning. Script is running over Desktop.
Good
Sir, how to extract wikidata informations from RDF lookup? I think I should go through SPARQL query language. I am finding subject and predicate into RDF lookup for PubChem compounds.
Pubchem ID is Property (P662) So yes, just use RDF/SPARQL as in the tutorial
This is looking good.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS66OUXLXIEDGTLENR3QGGALFA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CFV4Y#issuecomment-524573427, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYJRJHPHC3ZPN6FVU3QGGALFANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Setting new run for getting WIKIDATA ID. Splitted compound names into 20 small chunks each of 100 compounds each.
Error for batch run of 2114 compounds.
Generic values (AMIDictionaryTool)
================================
basename null
cproject
ctree
cTreeList null
dryrun false
excludeBase null
excludeTrees null
file types []
forceMake false
includeBase null
includeTrees null
log4j
logfile null
verbose 0
Specific values (AMIDictionaryTool)
================================
dataCols null
dictionary [compounds]
dictionaryTop compounds
hrefCols null
input null
informat null
dictInformat null
linkCol null
log4j null
nameCol null
operation create
outformats [xml]
splitCol ,
termCol null
terms null
wikiLinks [wikipedia, wikidata]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!.!!!!!!!!!!!!!!.!!!.!!!!!!!!!!!.!!!!.....!!!!!.!!.!.!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!...!!!..!!!!!!!!!!!!.!!!!!!!!!.!!!!!!!..!!!!!.!!.!!!!!.!.!!...!!!!!!!.!!!!.!!!..!.!.!!!!..!!!!!!.!!!.!!!!!!!.!!!!!!!!!!!!!.!!!!.!!!!.!!.!!.!!!!!!!!......!!!!!!!.!.!!.!!!!!!!!!!!!!!!!..!!!!!!!!!..!!!!!!!!!!!!!.!!!.!!!!!!!!!!.!!!!!!!!..!.!!!!!!!!!..!.!!!!!!!![Fatal Error] :89:3725: The element type "a" must be terminated by the matching end-tag "</a>".
Exception in thread "main" java.lang.RuntimeException: nu.xom.ParsingException: The element type "a" must be terminated by the matching end-tag "</a>". at line 89, column 3725
at org.contentmine.eucl.xml.XMLUtil.parseXML(XMLUtil.java:395)
at org.contentmine.ami.lookups.WikipediaLookup.getHtmlBodyFromUrl(WikipediaLookup.java:391)
at org.contentmine.ami.lookups.WikipediaLookup.getWikidataHtmlBody(WikipediaLookup.java:379)
at org.contentmine.ami.lookups.WikipediaLookup.queryWikidata(WikipediaLookup.java:416)
at org.contentmine.ami.tools.AMIDictionaryTool.addWikiLinks(AMIDictionaryTool.java:755)
at org.contentmine.ami.tools.AMIDictionaryTool.createDictionaryListInRandomOrder(AMIDictionaryTool.java:736)
at org.contentmine.ami.tools.AMIDictionaryTool.addEntriesToDictionaryElement(AMIDictionaryTool.java:717)
at org.contentmine.ami.tools.AMIDictionaryTool.writeNamesAndLinks(AMIDictionaryTool.java:685)
at org.contentmine.ami.tools.AMIDictionaryTool.createDictionary(AMIDictionaryTool.java:524)
at org.contentmine.ami.tools.AMIDictionaryTool.runDictionary(AMIDictionaryTool.java:409)
at org.contentmine.ami.tools.AMIDictionaryTool.runSpecifics(AMIDictionaryTool.java:398)
at org.contentmine.ami.tools.AbstractAMITool.runCommands(AbstractAMITool.java:218)
at org.contentmine.ami.tools.AMIDictionaryTool.main(AMIDictionaryTool.java:362)
Caused by: nu.xom.ParsingException: The element type "a" must be terminated by the matching end-tag "</a>". at line 89, column 3725
at nu.xom.Builder.build(Unknown Source)
at nu.xom.Builder.build(Unknown Source)
at org.contentmine.eucl.xml.XMLUtil.parseXML(XMLUtil.java:392)
... 12 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 89; columnNumber: 3725; The element type "a" must be terminated by the matching end-tag "</a>".
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
... 15 more
runWDID.sh: line 2: $'\r': command not found
runWDID.sh: line 3: $'\r': command not found
runWDID.sh: line 4: $'\r': command not found
Thanks,
Difficult to work out what the problem is without more details.
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir, I am getting WIKIDATA ID for compound names using ami-dictionary.
For batch run, a script runs ami-dictionary for 2114 compound names as search terms inputs.
I splitted the script into 21 small chunks. Each containing 100 compound names.
Each splitted run is generating output, except one. Example script is for separate 1 to 21 batch runs ( each batch is for 100 compound names). 6th batch run is producing error. And rest are running well.
On Sun, Aug 25, 2019 at 12:40 PM Ambarish Kumar notifications@github.com wrote:
Sir, I am getting WIKIDATA ID for compound names using ami-dictionary.
As a batch run, a script runs ami-dictionary for 2114 compound names as search terms inputs.
I splitted the script into 21 small chunks. Each containing 100 compound names.
Good design
Each splitted run is generating output, except one. Example script https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/runWDIDbatch1-21.sh is for first 1 to 21 batch runs ( each batch is for 100 compound names). 6th batch run is producing error. And rest are running well.
So the classic way is a (binary) chop. Split the 6th group into (say) 10 groups (6.0 ... 6.9) Run them all. One will fail (say 6.3) then split 6.3 into 6.3.0 ... 6.3.9 and we'll know where the problem is.
—
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZW72O7F7K2BTB3H73QGJVSRA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CR25Q#issuecomment-524623222, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4RVJ37O336HEEEF6DQGJVSRANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir,
Run error is due to 4-methylacetophenone
.
Compiled all batch run results to get WIKIDATA ID.
Renaming resolveCompFreqCount.tsv
to resolveCompFreqCountInChIWD.tsv
.
Column description for the sheet - resolveCompFreqCountInChIWD.tsv is as follows.
name
and description
column is left only for check for disambiguation. As WIKIDATA ID are disambiguated, these will be dropped down.
Listed disambiguations into WIKIDATA ID search are as follows.
C6956 ni
C6282 sa2 cell line
C6283 sa3 Wikimedia disambiguation page
C6395 sh1 Wikipedia disambiguation page
C6396 sh2 InterPro Domain
C6397 sh3 InterPro Domain
C6398 sh4 Wikipedia disambiguation page
1389
WIKIDATA ID.
395
query string for WIKIPEDIA link.
On Mon, Aug 26, 2019 at 9:18 AM Ambarish Kumar notifications@github.com wrote:
Sir,
-
Run error is due to 4-methylacetophenone.
Well done to identify this. I'll have a browse to see what the problem is. Maybe it has some blank fields
-
Compiled all batch run results to get WIKIDATA ID.
*Renaming resolveCompFreqCountInChI.tsv to resolveCompFreqCountInChIWD.tsv .
Column description for the sheet - resolveCompFreqCountInChIWD.tsv https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/resolveCompFreqCountInChIWD.tsv is as follows.
Compound_identifier - unique compound identifier.
original_name - original compound name present into the EssoilDB1.0.
clean_name - cleaned name of compounds.
cid - compound cid (retrieved from PubChem).
Freq - frequency count of each compound into essential oil profile of plants reported into the Essoildb1.0.
InChI - InChIs of compound (retrieved from PubChem)
wikidata - WIKIDATA identifier.
name - WIKIDATA lookup name for each compound.
description - description of each compound for wikidata lookup.
wikipedia - WIKIPEDIA query string.
name and description column is left only for check for disambiguation. As WIKIDATA ID are disambiguated, these will be dropped down.
Listed disambiguations into WIKIDATA ID search are as follows.
C6956 ni
I don't know what this is. I am sure it's not the element Ni. It's probably somethinng like "No Information". OMIT
C6282 sa2 cell line C6283 sa3 Wikimedia disambiguation page C6395 sh1 Wikipedia disambiguation page C6396 sh2 InterPro Domain C6397 sh3 InterPro Domain C6398 sh4 Wikipedia disambiguation page
These are NOT compounds, so OMIT
1389 WIKIDATA ID. 395 query string for WIKIPEDIA link.
Thanks. Can you also generate the short InChI? I'll explain ARE YOU FREE FOR A SKYPE?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS42S7JDVUYIT746WXLQGOGUHA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5DU5MA#issuecomment-524766896, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS635FNOISLK26PMW3TQGOGUHANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Yes sir. I at hangout right now.
2 minutes...
On Mon, Aug 26, 2019 at 12:01 PM Ambarish Kumar notifications@github.com wrote:
Yes sir. I at hangout right now.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS4PYPHW7TOQGG6OHFDQGOZYPA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5EBGNA#issuecomment-524817204, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4GQV5AT4EIDIDZAPTQGOZYPANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
The problems you are having are due to the lookup process in Wikidata. When you look for C758 by its NAME (-)-alloaromadendrene you find all objects with alloaromadendrene in. That includes articles! Like Essential Oil Alloaromadendrene from Mixed-Type Cinnamomum osmophloeum Leaves Prolongs the Lifespan in Caenorhabditis elegans So we have to search for the compound itself using the PubChem ID. I think the best things is to delete H and I (name and description) columns and rerun with SPARQL for PubChemID
P.
On Mon, Aug 26, 2019 at 11:55 AM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:
On Mon, Aug 26, 2019 at 9:18 AM Ambarish Kumar notifications@github.com wrote:
Sir,
-
Run error is due to 4-methylacetophenone.
Well done to identify this. I'll have a browse to see what the problem is. Maybe it has some blank fields
-
Compiled all batch run results to get WIKIDATA ID.
*Renaming resolveCompFreqCountInChI.tsv to resolveCompFreqCountInChIWD.tsv.
Column description for the sheet - resolveCompFreqCountInChIWD.tsv https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/resolveCompFreqCountInChIWD.tsv is as follows.
Compound_identifier - unique compound identifier.
original_name - original compound name present into the EssoilDB1.0.
clean_name - cleaned name of compounds.
cid - compound cid (retrieved from PubChem).
Freq - frequency count of each compound into essential oil profile of plants reported into the Essoildb1.0.
InChI - InChIs of compound (retrieved from PubChem)
wikidata - WIKIDATA identifier.
name - WIKIDATA lookup name for each compound.
description - description of each compound for wikidata lookup.
wikipedia - WIKIPEDIA query string.
name and description column is left only for check for disambiguation. As WIKIDATA ID are disambiguated, these will be dropped down.
Listed disambiguations into WIKIDATA ID search are as follows.
C6956 ni
I don't know what this is. I am sure it's not the element Ni. It's probably somethinng like "No Information". OMIT
C6282 sa2 cell line C6283 sa3 Wikimedia disambiguation page C6395 sh1 Wikipedia disambiguation page C6396 sh2 InterPro Domain C6397 sh3 InterPro Domain C6398 sh4 Wikipedia disambiguation page
These are NOT compounds, so OMIT
1389 WIKIDATA ID. 395 query string for WIKIPEDIA link.
Thanks. Can you also generate the short InChI? I'll explain ARE YOU FREE FOR A SKYPE?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS42S7JDVUYIT746WXLQGOGUHA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5DU5MA#issuecomment-524766896, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS635FNOISLK26PMW3TQGOGUHANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
OK sir. I am going through the SPARQL queries.
PubChem is slightly messy - I have been tweeting. If there is a Wikidata entry (e.g. Camphor) it links to the EN language Wikipedia. If there's no WP but there is a Wikidata it links to that but it's still labelled Wikipedia (wrongly).
So the safest approach is name ==pubchem=> Pubchem CID PubchemID ==pubchem=> Wikipedia link (page) CID ==wikidataSparql=> wikidata ID
P.
On Mon, Aug 26, 2019 at 12:04 PM Ambarish Kumar notifications@github.com wrote:
OK sir.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSY24VMGUFFZOIXRJH3QGO2DVA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5EBMII#issuecomment-524817953, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS57Q2Z6T6RY7AU3DWTQGO2DVANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir,
Please check for the updated table - resolveCompTable20190827.tsv containing cleaned compound names
, compound cid
, InChIs
, InChIKey
and WIKIDATA id
.
A short briefing to generate the WIKIDATA id
based on compound cid
using SPARQL query is as follows.
SELECT DISTINCT ?compound ?compoundLabel ?cid WHERE { ?compound wdt:P662 ?cid SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} }
- Step 2
Download all retrieved records of `?compounds` - as WIKIDATA entry and their `?cid` - PubChem compound CIDs.
- Step 3
Map the WIKIDATA id to the EssoilDB compound table using compound CIDs.
- RScript to map records of SPARQL query results to EssoilDB compound tables.
compTable<-read.csv("E:/resolveCompTable.csv")
sparqlResults<-("E:/sparqlResults.csv")
resolveCompTable20190827<-merge(compTable, sparqlResults, by="cid", all.x = TRUE)
write.csv(resolveCompTable20190827,"E:/resolveCompTable20190827.csv")
PubChem has WIKIDATA property `P662`. SPARQL query retrieves all compounds and their CIDs. `?compound` is used as subject, `wdt:P662` is used as predicate and `?cid` as object into the query.
Additionally one may go for retrieving ChEBI identifiers, KEGG identifiers Chemspider identifier, compound formula, InChIKey and CAS number (available into WIKIDATA) and so on as per the SPARQL query.
SELECT DISTINCT ?compound ?compoundName ?cas ?formula ?compoundLabel ?inchikey ?chemspider ?pubchem ?chebi ?KEGG_id WHERE { ?compound wdt:P662 ?cid . OPTIONAL { ?compound wdt:P231 ?cas . } OPTIONAL { ?compound wdt:P274 ?formula . } OPTIONAL { ?compound wdt:P235 ?inchikey . } OPTIONAL { ?compound wdt:P661 ?chemspider . } OPTIONAL { ?compound wdt:P662 ?pubchem . } OPTIONAL { ?compound wdt:P683 ?chebi .} OPTIONAL { ?compound wdt:P665 ?KEGG_id .} SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
InChIKey is added using PubChem identifier exchange services using PubChem compound CID.
Count of retrieved WIKIDATA identifiers for EssoilDB comounds is `1317`
[SPARQL query editor](https://query.wikidata.org/)
On Tue, Aug 27, 2019 at 7:55 AM Ambarish Kumar notifications@github.com wrote:
Sir, Please check for the updated table - resolveCompTable20190827.tsv https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/resolveCompTable20190827.tsv containing cleaned compound names, compound cid, InChIs, InChIKey and WIKIDATA id.
Thanks, Looks good. Manny and I will need to go through the chemistry and remove false positives (e.g. SA3 )
A short briefing to generate the WIKIDATA id based on compound cid is as follows.
I can't see this. Probably add this as a comment in the Issue
—
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSYA5KZ5G2Y37E53NQ3QGTFWZA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5GWOYA#issuecomment-525166432, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2MNBOGGRDNVCT6L3DQGTFWZANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir, Please go through the edited previous issue. All steps are mentioned into it.
Thanks - my bad, I was reading the email, not the Issues.
On Tue, Aug 27, 2019 at 9:45 AM Ambarish Kumar notifications@github.com wrote:
Sir, Please go through the edited previous issue. All steps are mentioned into it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS4GL4CHBSXKMIA6IV3QGTSTLA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5G7WZI#issuecomment-525204325, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS7TX27JKQHITUJMWWTQGTSTLANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir, chemical structure diagram can be fetched using PubChem downloads services. Alternately, using SPARQL query.
Maybe we should download 2.1K images into the compound directory. Pubchem is more standardised than Wikidata, but some of the diagrams are awful. Then we can link directly without having to go through the WWW.
Comment: Wikidata does not have entries for over 100 compounds. I'll find out how to add them from PubChem.
OK sir.
Chemical nomenclature is complex and ambiguous. Any attempt to disambiguate MUST record ambiguity. Thus acetyl-furan could be 1-acetyl-furan or 2-acetyl-furan, OPSIN (https://opsin.ch.cam.ac.uk) gives:
and this must be recorded
Always test with OPSIN.