Open petermr opened 5 years ago
== create sample disambiguation of chemistry ==
For each lookup go to the site and lookup the name. Record the ID if found, else leave empty. If there are special comments record them.
This may be automatable through Egon's tools.
The tools to use are:
PubChem https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi This reads a file of chemical names (here "Synonyms") and translates them InChIs. This is good for trivial names and common systematic names but fails on novel systematic names. Outputs are (a) Pubchem IDs (CIDs) (b) InChis.
Opsin https://bitbucket.org/dan2097/opsin/ This is good for systematic names but fails on unusual trivial names. This reads a list of names and converts to InChis. Opsin is best used as a downloaded Java Jar, (https://bitbucket.org/dan2097/opsin/downloads/) and run by:
java -jar [jarfile.jar] -o inchi namesin.txt inchiout.txt
By using InChIs we have a correspondence between the systems.
INPUTS From the CSV file output column 2 (common names). Edit out quotes (") and delete spaces round " - "; split esters "bornylacetate" => bornyl acetate.
OUTPUTS If pubchem has an ambigous compound it outputs stereo isomers. These may need editing manually to give the commonest.
Typical example for https://pubchem.ncbi.nlm.nih.gov/compound/42608158 shows which the most likely isomer is for alloaromadendrene (Allo-Aromadendrene)
Vinita should supervise the processing, which will be largely carried out by Ambarish and later Shruthi.
It is particularly important to check correctness of results.
Method: Divide the work into small batches (Pubchem may mandate this, but it's good practice). At this stage no more than 100 compounds per batch
0/ There should be a single communal table (as described). There may need to be more columns than specified there. 1/ run batch vs Pubchem to get (a) CIDs (b) InChIs. Add comments (c) where Pubchem has failed or is ambiguous. 2/ run batch on OPSIN to get (d) InChIs and (e) comments.
3/ search Wikidata with (a) CID (b) InChI (c) original name if fails . This should be done automatically . For unambiguous compounds this will give a link to Wikidata that should be included in the EssoilDB database.
The correctness of the search will be shown by matching InChIs for numerous compounds. We will report early results in the poster.
Sir, Please go through the Batch-0 run for the first 100 compounds. compNameDisambiguation.csv(https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/compNameDisambiguation.csv) - output file for EssOilDB entry, PubChem lookups, OPSIN lookups and comments.
Wikidata entry is remaining right now.
Sir, Please go through the files. 100cnamePubchemAndOPSIN.csv 100cnamePubChem.csv 100cnameOPSIN.csv
PubChem lookup generates isomers. Those are present into the file as output is generated (also order of PubChem lookup entries are same as of generated output.)
Files are meaningless unless they have documentation. Please briefly record (on Github) how these files were created.
Also I will probably move these files in the directory structure
On Mon, Jul 15, 2019 at 12:27 PM Ambarish Kumar notifications@github.com wrote:
Sir, Please go through the files. 100cnamePubchemAndOPSIN.csv https://github.com/gilienv/EssOilDB/blob/master/100cnamePubchemAndOPSIN.csv 100cnamePubchem.csv https://github.com/gilienv/EssOilDB/blob/master/100cnamePubchem.csv 100cnameOPSIN.csv [https://github.com/gilienv/EssOilDB/blob/master/100cnameOPSIN.csv]
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZK47ITAATNHSC53J3P7RNK5A5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ5M7SI#issuecomment-511365065, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS53XUTSVE6D6DREN2DP7RNK5ANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
I had better luck fixing chemical names with this: https://www.ncbi.nlm.nih.gov/pcsubstance/?term=%22(Z)-BETA-OCIMENE%22
Not so much luck with this: https://opsin.ch.cam.ac.uk/
This one is pretty good too: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:10447
Thank you Manny!
Ambarish - Requesting you to look up Manny's suggestions above and check how we fare in terms of Chemical Disambiguation
As PMR had first pointed out - we need to document the KINDS of errors we have in the Chemistry.
At present, the most comprehensive assessment of types of errors has been conducted by Manny, and we have had a few meetings to discuss various issues.
More on my dropbox, but happy to add here if Ambarish initiates a list of Error types, along with V.1 entries for each kind
We are clearly going to have to do manual correction of chemical names. Common problems include:
To be correct we should have at least 2 columns (raw data, curated data)
On Wed, Jul 17, 2019 at 8:59 AM Gitanjali Yadav notifications@github.com wrote:
As PMR had first pointed out - we need to document the KINDS of errors we have in the Chemistry.
At present, the most comprehensive assessment of types of errors has been conducted by Manny, and we have had a few meetings to discuss various issues.
More on my dropbox, but happy to add here if Ambarish initiates a list of Error types, along with V.1 entries for each kind
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS6HRCJTONJWSHPKPRDP73GNPA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DL4XI#issuecomment-512147037, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS3ISTO5SRK6MF3GAFTP73GNPANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
We generated 1000 records of compounds using OPSIN and PubChem. For getting WIKIDATA lookup column, we will have to reset the run. Preparing run for getting WIKIDATA and WIKIPEDIA lookups.
Ambarish has made good progress on disambiguation - see
https://github.com/gilienv/EssOilDB/blob/master/chemistry/EssOilDBOPSINPubChem.tsv
This has lookups on trivial names (coname
). (Trivial means "commonly used", not algorithmically parsable).
(@mannyrules note that OPSIN is created for systematic names and has a limited number of trivial names. By contrast Pubchem and ChEBI have a lot of trivial names but cannot parse systematic names that aren't in its database. So OPSIN+Pubchem/ChEBI should catch most.
@ambarishK and I had a good discussion today. The result in OPSINPubChem is:
coname
table available so we can check.coname
. These are essential to make sure we keep the corresondence between the Pubchem
lookup and OPSIN
. Following Wikidata and Pubchem these will be sequential (C1234
etc.)(2,4)-nonadienal .. (2,4)-nonadienal'' is unparsable due to the following being uninterpretable: ''(2,4)-nonadienal''
An OPSIN-parsable name is 2,4-nonadienal
ACTION we will need (at least) three columns
original name
. This is critical, in case it is actually correct in hindsightcleaned name
. This is a sister column, with the corrected name. In some cases this might be edited more than once. I suggest that we enter in this column only when at least one service can positively look it up.name comments
. Brief account of who cleaned the name and why. (e.g. removeBrackets petermr 20190720
. We should try to create keywords e.g. removeBrackets
.The benefit is that @mannyrules and other volunteers (@petermr ) can edit this on a day-by-day basis without affecting the rest of the submission.
Both Pubchem
and OPSIN
produce InChIs if successful. We should find out as soon as possible when InChIs don't agree as this will probably be an important new problem.
Created a new table EssOilDBOPSINPubChemInChI.csv
with some columns removed and sorted. This is just for more rapid comparison of InChIs. Ignore it.
I have been and will continue to relatively quickly replace errors in punctuation as well as “foreign” characters (eg, Ã, ã) etc.. I have also created a little table for myself where I am storing other, stranger anomalies such as things that look like spaces, but are actually some indescribable character.
Each time I find one, I save it so I can go through all of them “one last time” after the last person has touched the data.
I don’t know the cause of this strange data. It could be that we are each using different keyboard language settings, operating systems, or different dictionaries as default in our spreadsheet programs.
No matter though. I’m confident I can clean that stuff up.
My biggest limitation is not knowing what’s actually correct or incorrect. But on the other hand, my layman’s eyes see things others may miss, so together we’ll ferret out the weirdness.
Sent with GitHawk
A very quick eyeball of InChIs
Of the 1000 names, approximately 700 were translated by PubChem and 400 by OPSIN (though there is still a punctuation problem and this number should increase.
There are 300 whcih have InChIs from both and I have only spotted 3-4 which are grossly different (mainly because OPSIN doesn't have the right systematica names (e.g.
terpinen-4-ol
is a derivative of terpinene
but OPSIN doesn't have this trivial name and translates it as ter[pinene]-4-ol
- 3 pinenes stitched together. But generally OPSIN agrees with Pubchem ca 99% which is great. @vinitamehlawat we can report this figure.
Brilliant, The main thing is to record everything and try to systematize the errors. For example: EXTRA_SPACE MISSING_SPACE INVISIBLE_CHAR
Then we can analyze what is most frequent.
I agree with you that there may be an invisible character problem. This might come from non-Unicode characters that cannot be rendered. Believe me, I know most of the "tricks"
We should only use ASCII characters (32-126). No clever spaces (non-breaking space, zero-width space, etc.). No greek characters (=> beta, etc.) No em-dashes (only hyphen-minus), no umlauts and other diacritics. Quoting is a real problem and in general No Quotes or apostrophes.
I don't think we can "correct" any of this algorithmically and if we do I suggest that I do it.
P.
On Thu, Jul 18, 2019 at 4:46 PM Manny notifications@github.com wrote:
I have been and will continue to relatively quickly replace errors in punctuation as well as “foreign” characters (eg, Ã, ã) etc.. I have also created a little table for myself where I am storing other, stranger anomalies such as things that look like spaces, but are actually some indescribable character.
Each time I find one, I save it so I can go through all of them “one last time” after the last person has touched the data.
I don’t know the cause of this strange data. It could be that we are each using different keyboard language settings, operating systems, or different dictionaries as default in our spreadsheet programs.
No matter though. I’m confident I can clean that stuff up.
My biggest limitation is not knowing what’s actually correct or incorrect. But on the other hand, my layman’s eyes see things others may miss, so together we’ll ferret out the weirdness.
Sent with GitHawk http://githawk.com
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS65JSFGY4ZH26AGKQ3QACF6HA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2I5HNI#issuecomment-512873397, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2CTSQAGS24GFOF6JTQACF6HANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
I will start adding records after meeting today. Also, I will draft all possibilities of name inconsistencies with example.
Sir
I prepared a fresh sheet for name cleaning.
It containes exact 7162 unique compound records.
The we discussed today is there as it is. - https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/EssOilDBOPSINPubChemInChIs_A.csv
It has 7169 unique compound entries.
It is better to continue with the today discussed sheet.
I tried to get into the difference of 07 records. It may be because of repeated 07 compound names.
Documentation for generating sheet is at
On Fri, Jul 19, 2019 at 10:04 AM Ambarish Kumar notifications@github.com wrote:
Table for name correction https://github.com/gilienv/EssOilDB/blob/master/chemistry/EssOilDBOPSINPubChemInChIs_A.csv
We cannot create additional tables until we have agreed the identifiers.
I will start adding records after meeting today. Also, I will draft all
possibilities of name inconsistencies with example.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS6TZHZWCWN2VTT6CPDQAF7SBA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2LBYDI#issuecomment-513154061, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4VBVQ2NTYAQ4VW2BLQAF7SBANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir
Check for the sheet EssOilDBOPSINPubChemInChIsANewFinal.csv.
It contains exactly same identifiers as of the first sheet (the finalised one) - EssOilDBOPSINPubChemInChIs_A.csv
Removing sheets - EssOilDBOPSINPubChemInChIsANew.csv and EssOilDBOPSINPubChemInChIsANew.tsv
We are clearly going to have to do manual correction of chemical names. Common problems include:
- misspelling e.g - 1,8-cineol
- spaces included "alpha - pinene" e.g - 1,2,3,4-Tetrahydro-1,5,7-trimethyl naphthalene
- spaces omitted "ethylacetate" e.g - (e)-sesquilavandulylacetate
- hypens omitted/included e.g - 1,8 cineole
- quotes (strange, unbalanced...) e.g - (2,4)-nonadienal
- multiple locants
EssOilDB entry is "bergamotol acetate" but PubChem search shows - Trans-.alpha.-Bergamatol Acetate OR (Z)-.Alpha.-Bergamotol Acetate OR Cis-alpha-Bergamotol Acetate.
- missing locants
e.g - borneole
e.g - 1,4-cadinadienea
e.g humulene epoxide iii
.It should have been humulene epoxide III
.
e.g EssOilDBEntry - hexadecanoic0acid
. It should have been Hexadecanoic acid
.
e.g - EssOilDBEntry is (2e)-octen-1-ol
. It should have been E-2-octen-1-ol
.
To be correct we should have at least 2 columns (raw data, curated data)
Thanks, Yes, This is a difficult area and we are going to have to treat it carefully and systematically. It is essential to preserve the original spelling regardless of whether it is "wrong" or "right". So we must have a column for raw name. There are names with are very similar but represent different compounds. If we "correct" these we will corrupt the database. Thus:
decanol decanal decenol decenal
are all valid names and are all distinct. (if the original abstracter made a copying error it may be difficult to detect)
On Mon, Jul 22, 2019 at 8:56 AM Ambarish Kumar notifications@github.com wrote:
We are clearly going to have to do manual correction of chemical names. Common problems include:
- misspelling e.g - 1,8-cineol
This is not a misspelling, it's a synonym. See https://pubchem.ncbi.nlm.nih.gov/compound/Eucalyptol which lists
2.4Synonyms Help New Window https://pubchem.ncbi.nlm.nih.gov/compound/Eucalyptol#section=Synonyms&fullscreen=true 2.4.1MeSH Entry Terms Help New Window https://pubchem.ncbi.nlm.nih.gov/compound/Eucalyptol#section=MeSH-Entry-Terms&fullscreen=true
1,8 Cineol
1,8 Cineole
1,8 Epoxy p menthane
1,8-cineol
1,8-cineole
1,8-Epoxy-p-menthane
cineole
eucalyptol
Soledum
- spaces included "alpha - pinene" e.g - 1,2,3,4-Tetrahydro-1,5,7-trimethyl naphthalene
Yes!
- spaces omitted "ethylacetate" e.g - (e)-sesquilavandulylacetate
Yes
- hypens omitted/included e.g - 1,8 cineole
Yes
- quotes (strange, unbalanced...) e.g - (2,4)-nonadienal
Yes
multiple locants
missing locants
We should create short unique codes for this:
examples SYNONYM ADDED_SPACE MISSING_SPACE MISSING_HYPHEN ADDED_HYPHEN QUOTE_ERROR MULTIPLE_LOCANT MISSING_LOCANT
By using codes like this (always uppercase) we can normalize the reporting of errors.
To be correct we should have at least 2 columns (raw data, curated data)
… <#m_-1617261654995852357_m7121751399519024663> On Wed, Jul 17, 2019 at 8:59 AM Gitanjali Yadav @.***> wrote: As PMR had first pointed out - we need to document the KINDS of errors we have in the Chemistry. At present, the most comprehensive assessment of types of errors has been conducted by Manny, and we have had a few meetings to discuss various issues. More on my dropbox, but happy to add here if Ambarish initiates a list of Error types, along with V.1 entries for each kind — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#76 https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS6HRCJTONJWSHPKPRDP73GNPA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DL4XI#issuecomment-512147037>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS3ISTO5SRK6MF3GAFTP73GNPANCNFSM4ICLYMFQ . -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSY52VQRXDO7IE6BODLQAVR2FA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2PC5PY#issuecomment-513683135, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS36BGGEQA4SZIKWWV3QAVR2FANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Please go through the name cleaning sheet updated by me - Copy of EssOilDBOPSINPubChemInChIs_A.csv.
There is additional column for IUPAC name. I have added first 50 records into it.
I added a short description of today file. documentation page
Sir,
Compound_identifiers are now as C1,C2,C3 ......which corresponds to previous identifiers 1C, 2C, 3C ...... respectively. Updated sheet with compound_identifier.
I will revisit this after I have created the poster. We have to start again and document exacty what we start with and what operations we carry out. The confusing thing was the IUPAC names which were not in the original V1.0 (as far as I know). In fact there is only one compound name and possibly a CAS number.
But I have to talk with Gita first.
On Tue, Jul 23, 2019 at 1:08 PM Ambarish Kumar notifications@github.com wrote:
Sir, Please go through the documentation page https://github.com/gilienv/EssOilDB/blob/master/chemistry/Disambiguating_chemistry_and_fixing_typos.md. I added a short description of today file.
Compound_identifiers are now as C1,C2,C3 ......which corresponds to 1C, 2C, 3C ...... respectively. Updated sheet with compound_identifier https://github.com/gilienv/EssOilDB/blob/master/chemistry/EssOilDBOPSINPubChemInChIs_A.csv .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS5KB7RRFGU523U7PQDQA3YFTA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2S4YOQ#issuecomment-514182202, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS7GIRTET33XU7BBOG3QA3YFTANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir, I have listed WIKIDATA 'Q' ID for all compounds onto the poster. Please go through the page
Thank you. I will look
On Wed, Jul 24, 2019 at 1:11 PM Ambarish Kumar notifications@github.com wrote:
Sir, I have listed WIKIDATA 'Q' ID for all compounds onto the poster. Please go through the page https://github.com/gilienv/EssOilDB/blob/master/EssOilDBPosterWIKIDATA-QID.md
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS2GPQ2ZT2XJWXS6NWDQBBBGHA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2WEDUY#issuecomment-514605523, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZR3QRKNVS56FFSKATQBBBGHANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Dear Sir
One pic I found in my mobile camera roll, It is of harvesting time ( of this March when I had visited my home ). Pic has Lantana camara shrubs spread at the bottom. If convenient, it can be included into the poster.
Thanks Ambarish, Nice offer, but I had to send the poster off today.
On Thu, Jul 25, 2019 at 7:08 AM Ambarish Kumar notifications@github.com wrote:
Dear Sir
One pic https://github.com/gilienv/EssOilDB/blob/master/assets/IMG_9575.JPG I found in my mobile camera roll, It is of harvesting time ( of this March when I had visited my home ). Pic has Lantana camara shrubs spread at the bottom. If convenient, it can be included into the poster.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS6AES7GHQNAXWI3YULQBE7PVA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2YOTDA#issuecomment-514910604, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS5XHR7LVMCKCUCUYHLQBE7PVANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir, I am adding cleaned names to the sheet.
Column description is as follows.
Compound_identifier
- Unique ID assigned to each compounds.Original_name
- Original name of compounds mentioned into the database.cleaned_name
- Cleaned name of compounds.name_comments
- comments for name cleaning.Cleaned names are obtained from PubChem ( as compound name lookup).
I am adding cleaned_name from starting entries (from beginning).
Few names are not retrieved. Ex -
ID Original_name C4. (e)-2,(z)-6-decadienal C5. (e)-2,2-decenal C7. (e)-2-decanal C13. (e)-2-hexyl butyrate C20. (e)-anethole+bornyl acetate
Please suggest for any correction or changes. Also, we will have to check weather all cleaned_name generate InChIKey or not.
Thank you, Will have a look. This is a very important table Do they all have E2.0 identifiers
On Fri, Aug 16, 2019 at 8:58 AM Ambarish Kumar notifications@github.com wrote:
Sir, I am adding cleaned names to the sheet https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/CopyEssOilDBOPSINPubChemInChIs_A_Sheet1.csv .
Column description is as follows.
- Compound_identifier - Unique ID assigned to each compounds.
- Original_name - Original name of compounds mentioned into the database.
- cleaned_name - Cleaned name of compounds.
- name_comments - comments for name cleaning.
Cleaned names are obtained from PubChem ( as compound name lookup).
I am adding cleaned_name from starting entries (from beginning).
Few names are not retrieved. Ex -
ID Original_name C4. (e)-2,(z)-6-decadienal C5. (e)-2,2-decenal C7. (e)-2-decanal C13. (e)-2-hexyl butyrate C20. (e)-anethole+bornyl acetate
Please suggest for any correction or changes.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS6VFTR5MAGMOKLB3OTQEZM3FA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4N6UEY#issuecomment-521923091, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS5QC4CZAEFN6J45RT3QEZM3FANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Yes sir.
Please do NOT use capital letters in chemical names, except for atoms and stereo identifiers
(Z)-Alpha-Bisabolene should be (Z)-alpha-bisabolene
In general PubChem or ChEBI will give the correct capitalization.
Common rules E, Z, R, S capitalized (E)-but-2-ene, (R,S)-tartaric acid, o-, m-,p- lowercase o-cresol , p-menthane N- capitalized N-ethyl succinimide
But the safe way is to look this up.
This table will be a very important resource for the future.
On Fri, Aug 16, 2019 at 9:07 AM Ambarish Kumar notifications@github.com wrote:
Yes sir.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS3QX3BJNZ5GJTKP4S3QEZN47A5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4N7G3I#issuecomment-521925485, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2FGWFXN7C65XO3QFLQEZN47ANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
THERE IS A VERY VERY SERIOUS ERROR IN COLUMN 12 WHICH CORRUPTS THE TABLE COMPLETELY.
The column is called "2,4-nonadienal" which is a VALUE not a name.
THIS MEANS THAT EVERY VALUE IN THIS COLUMN POINTS TO THE WRONG COMPOUND (PROBABLY OFF-BY-ONE).
PLEASE FIND WHERE THE ERROR OCCURRED . DO NOT HAND EDIT THE TABLE. IF YOU GET THIS WRONG IT WILL DESTROY THE TABLE FOR EVER.
Correct the software that generates the table and re-generate it .
What is the actual name of this column?
On Fri, Aug 16, 2019 at 9:18 AM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:
Please do NOT use capital letters in chemical names, except for atoms and stereo identifiers
(Z)-Alpha-Bisabolene should be (Z)-alpha-bisabolene
In general PubChem or ChEBI will give the correct capitalization.
Common rules E, Z, R, S capitalized (E)-but-2-ene, (R,S)-tartaric acid, o-, m-,p- lowercase o-cresol , p-menthane N- capitalized N-ethyl succinimide
But the safe way is to look this up.
This table will be a very important resource for the future.
On Fri, Aug 16, 2019 at 9:07 AM Ambarish Kumar notifications@github.com wrote:
Yes sir.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS3QX3BJNZ5GJTKP4S3QEZN47A5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4N7G3I#issuecomment-521925485, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2FGWFXN7C65XO3QFLQEZN47ANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir,
Removed column 12. It got left over the sheet, as I had put it over there to get their CIDs and InChIKey.
Sir, Making all changes related the case of letters used in nomenclature.
For name cleaning I have to go for each compound name PubChem lookup. If I make changes to case of letters or notations used into nomenclature (for entire Original_name
column over excel sheet), how will I verify for the cleaned_name?
Till now I gone for PubChem compound name lookup for each compounds manually. Please suggest how to proceed further.
Pubchem and ChEBI and Wikidata all have APIs for automatic lookup. You should never use manual ones for more than 10. All these systems have RESTful APIs. You construct a URL and then use curl or similar system to query the system. The results come back as JSON or XML depending on what is available. Some of them allow multiple queries in a batch.
Sir, How to select or keep CID in case multiple CIDs are generated for same compound name?
For example.
trans-caryophyllene
5281522
trans-caryophyllene
5281515
isocaryophyllene oxide
14350
isocaryophyllene oxide
1742211
This is a real and serious problem. @mannyrules take note. Chemical names are sometimes used to represent different structures, either because of generic nature or mistakes. Here are the entries for caryophyllene in Pubchem:
5281515 beta-Caryophyllene THIS NAME IS AMBIGUOUS , IT DOESN'T GIVE THE CHIRALITY
PubChem CID: 5281515 Structure: [image: beta-Caryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=2D-Structure [image: beta-Caryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID5281515 structure&tab=similarity Chemical Safety: [image: Irritant] [image: Health Hazard] https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=Safety-and-Hazards Laboratory Chemical Safety Summary (LCSS) Datasheet https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#datasheet=LCSS Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:
BETA-CARYOPHYLLENE
Caryophyllene
(-)-trans-Caryophyllene
(-)-beta-caryophyllene
THIS IS A DEFINED STEREO ISOMER
87-44-5 More... https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:
204.35 g/mol Dates:
Modify:
2019-08-10
Create:
2005-06-24
(-)-beta-caryophyllene is a beta-caryophyllene in which the stereocentre adjacent to the exocyclic double bond has S configuration while the remaining stereocentre has R configuration. It is the most commonly occurring form of beta-caryophyllene, occurring in many essential oils, particularly oil of cloves. It has a role as a non-steroidal anti-inflammatory drug, a fragrance and a metabolite. It is an enantiomer of a (+)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene.
beta-Caryophyllene, also known as caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/caryophyllene or (−)-β-caryophyllene, is a natural bicyclic sesquiterpene that is a constituent of many essential oils including that of Syzygium aromaticum (cloves), Cannabis sativa, rosemary, and hops. It is usually found as a mixture with isocaryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/isocaryophyllene (the cis double bond isomer) and α-humulene (obsolete name: α-caryophyllene), a ring-opened isomer. beta-Caryophyllene is notable for having both a cyclobutane https://pubchem.ncbi.nlm.nih.gov/compound/cyclobutane ring and a trans-double bond in a nine-membered ring, both rarities in nature . beta-Caryophyllene is a sweet and dry tasting compound that can be found in a number of food items such as allspice, fig, pot marjoram, and roman camomile, which makes beta-caryophyllene a potential biomarker for the consumption of these food products. beta-Caryophyllene can be found in feces and saliva.
2.1.2InChI *** THIS IS THE FUNDAMENTAL REPRESENTATION InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6+/t13-,14-/m1/s1
===================
=========5281522==========
Isocaryophyllene PubChem CID: 5281522 Structure: [image: Isocaryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=2D-Structure [image: Isocaryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID5281522 structure&tab=similarity Chemical Safety: [image: Irritant] [image: Health Hazard] https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=Safety-and-Hazards Laboratory Chemical Safety Summary (LCSS) Datasheet https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#datasheet=LCSS Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:
Isocaryophyllene
Caryophyllene
UNII-NRY8I0KNIR
beta-Caryophyllen
THIS NAME IS WRONG
gamma-caryophyllene More... https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:
204.35 g/mol Dates:
Modify:
2019-08-10
Create:
2005-06-24
Isocaryophyllene is a sesquiterpenoid.
Isocaryophyllene, also known as gamma-caryophyllene, belongs to the class of organic compounds known as sesquiterpenoids. Sesquiterpenoids are terpenes with three consecutive isoprene https://pubchem.ncbi.nlm.nih.gov/compound/isoprene units. Isocaryophyllene can be found primarily in saliva. Isocaryophyllene is found in allspice, and is widespread in plants (Jasminum, Origanum, and Pimpinella species).
2.1.2InChI * THE INCHI IS DIFFERENT https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=InChI&fullscreen=true*
InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6-/t13-,14-/m1/s1
===============(+)-beta-caryophyllene====
There is an enatiomer of
5281515 which is
===== 20831623==== (+)-beta-Caryophyllene PubChem CID: 20831623 Structure: [image: (+)-beta-Caryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=2D-Structure [image: (+)-beta-Caryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID20831623 structure&tab=similarity Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:
(+)-beta-caryophyllene
(+)-caryophyllene
(1S,4E,9R)-4,11,11-trimethyl-8-methylidenebicyclo[7.2.0]undec-4-ene
trans-(1S,9R)-4,11,11-trimethyl-8-methylenebicyclo[7.2.0]undec-4-ene
10579-93-8 More... https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:
204.35 g/mol Dates:
Modify:
2019-08-10
Create:
2007-12-05
(+)-beta-caryophyllene is a beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene in which the stereocentre adjacent to the exocyclic double bond has R configuration while the remaining stereocentre has S configuration. It is the enantiomer of (-)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28-%29-beta-caryophyllene, which occurs much more widely than the (+)-form. It has a role as a metabolite. It is an enantiomer of a (-)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28-%29-beta-caryophyllene.
2.1.2InChI
InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6+/t13-,14-/m0/s1
====================
On Sat, Aug 17, 2019 at 8:51 AM Ambarish Kumar notifications@github.com wrote:
Sir, How to select or keep CID in case multiple CIDs are generated for same compound name.
For example.
trans-caryophyllene 5281522 trans-caryophyllene 5281515
isocaryophyllene oxide 14350 isocaryophyllene oxide 1742211
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS7YZNGIURV5M566ZZ3QE6UWXA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QFZBY#issuecomment-522214535, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS6CM7XCQ64D4ETHHPDQE6UWXANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
The problems are:
"beta-caryophyllene" is ambiguous. It could be either (-) or (+) and there is no way of knowing which - they both occur. But PubChem may simply guess. This is not good, but it's hard to build the correct data structure.
"iso-caryophyllene has a synonym "beta-caryophyllene". This means that the same name has been used to refer to two different compounds. There is nothing we can do other than to ask experts which we should take.
THE FUNDAMENTAL REPRESENTATION IS THE INCHI. WE SHOULD ALWAYS USE THIS AS IT DESCRIBES EXACTLY WHAT WE HAVE.
It may be a good idea to do the primary lookup in ChEBI as I expect their quality control is higher.
On Sat, Aug 17, 2019 at 10:51 AM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:
This is a real and serious problem. @mannyrules take note. Chemical names are sometimes used to represent different structures, either because of generic nature or mistakes. Here are the entries for caryophyllene in Pubchem:
5281515 beta-Caryophyllene THIS NAME IS AMBIGUOUS , IT DOESN'T GIVE THE CHIRALITY
PubChem CID: 5281515 Structure: [image: beta-Caryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=2D-Structure [image: beta-Caryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID5281515%20structure&tab=similarity Chemical Safety: [image: Irritant] [image: Health Hazard]
https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=Safety-and-Hazards Laboratory Chemical Safety Summary (LCSS) Datasheet https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#datasheet=LCSS Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:
BETA-CARYOPHYLLENE
Caryophyllene
(-)-trans-Caryophyllene
(-)-beta-caryophyllene
THIS IS A DEFINED STEREO ISOMER
87-44-5 More... https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:
204.35 g/mol Dates:
Modify:
2019-08-10
Create:
2005-06-24
(-)-beta-caryophyllene is a beta-caryophyllene in which the stereocentre adjacent to the exocyclic double bond has S configuration while the remaining stereocentre has R configuration. It is the most commonly occurring form of beta-caryophyllene, occurring in many essential oils, particularly oil of cloves. It has a role as a non-steroidal anti-inflammatory drug, a fragrance and a metabolite. It is an enantiomer of a (+)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene.
beta-Caryophyllene, also known as caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/caryophyllene or (−)-β-caryophyllene, is a natural bicyclic sesquiterpene that is a constituent of many essential oils including that of Syzygium aromaticum (cloves), Cannabis sativa, rosemary, and hops. It is usually found as a mixture with isocaryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/isocaryophyllene (the cis double bond isomer) and α-humulene (obsolete name: α-caryophyllene), a ring-opened isomer. beta-Caryophyllene is notable for having both a cyclobutane https://pubchem.ncbi.nlm.nih.gov/compound/cyclobutane ring and a trans-double bond in a nine-membered ring, both rarities in nature . beta-Caryophyllene is a sweet and dry tasting compound that can be found in a number of food items such as allspice, fig, pot marjoram, and roman camomile, which makes beta-caryophyllene a potential biomarker for the consumption of these food products. beta-Caryophyllene can be found in feces and saliva.
2.1.2InChI *** THIS IS THE FUNDAMENTAL REPRESENTATION
InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6+/t13-,14-/m1/s1
===================
=========5281522==========
Isocaryophyllene PubChem CID: 5281522 Structure: [image: Isocaryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=2D-Structure [image: Isocaryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID5281522%20structure&tab=similarity Chemical Safety: [image: Irritant] [image: Health Hazard]
https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=Safety-and-Hazards Laboratory Chemical Safety Summary (LCSS) Datasheet https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#datasheet=LCSS Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:
Isocaryophyllene
Caryophyllene
UNII-NRY8I0KNIR
beta-Caryophyllen
THIS NAME IS WRONG
gamma-caryophyllene More... https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:
204.35 g/mol Dates:
Modify:
2019-08-10
Create:
2005-06-24
Isocaryophyllene is a sesquiterpenoid.
Isocaryophyllene, also known as gamma-caryophyllene, belongs to the class of organic compounds known as sesquiterpenoids. Sesquiterpenoids are terpenes with three consecutive isoprene https://pubchem.ncbi.nlm.nih.gov/compound/isoprene units. Isocaryophyllene can be found primarily in saliva. Isocaryophyllene is found in allspice, and is widespread in plants (Jasminum, Origanum, and Pimpinella species).
2.1.2InChI * THE INCHI IS DIFFERENT https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=InChI&fullscreen=true*
InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6-/t13-,14-/m1/s1
===============(+)-beta-caryophyllene====
There is an enatiomer of
5281515 which is
===== 20831623==== (+)-beta-Caryophyllene PubChem CID: 20831623 Structure: [image: (+)-beta-Caryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=2D-Structure [image: (+)-beta-Caryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID20831623%20structure&tab=similarity Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:
(+)-beta-caryophyllene
(+)-caryophyllene
(1S,4E,9R)-4,11,11-trimethyl-8-methylidenebicyclo[7.2.0]undec-4-ene
trans-(1S,9R)-4,11,11-trimethyl-8-methylenebicyclo[7.2.0]undec-4-ene
10579-93-8 More... https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:
204.35 g/mol Dates:
Modify:
2019-08-10
Create:
2007-12-05
(+)-beta-caryophyllene is a beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene in which the stereocentre adjacent to the exocyclic double bond has R configuration while the remaining stereocentre has S configuration. It is the enantiomer of (-)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28-%29-beta-caryophyllene, which occurs much more widely than the (+)-form. It has a role as a metabolite. It is an enantiomer of a (-)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28-%29-beta-caryophyllene.
2.1.2InChI
InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6+/t13-,14-/m0/s1
====================
On Sat, Aug 17, 2019 at 8:51 AM Ambarish Kumar notifications@github.com wrote:
Sir, How to select or keep CID in case multiple CIDs are generated for same compound name.
For example.
trans-caryophyllene 5281522 trans-caryophyllene 5281515
isocaryophyllene oxide 14350 isocaryophyllene oxide 1742211
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS7YZNGIURV5M566ZZ3QE6UWXA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QFZBY#issuecomment-522214535, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS6CM7XCQ64D4ETHHPDQE6UWXANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Yes sir, InChIs are more reliable as an identifier to get all information of chemical compounds.
Sir,
Please go through the sheet. It contains original_name for compounds, their clean_names and retrieved cids (to check weather clean_name entry is searchable into the PubChem API or not).
Column description is as follows.
Compound_identifiers
- assigned unique ID to compounds for EssoilDB2.0.original_name
- original name of compounds as present into db. clean_name
- cleaned names.cid
- PubChem cid.Name cleaning is done after making following changes.
1. All stereo-isomeric notations are made into caps-letters.
2. " -> removed.
3. <extraSpace>- -> removed<extraSpace>
4. ## -> replaced by white-space.
5. # -> removed.
6. utf-8 encoding.
7. ,<extraSpace> -> ,
8. (-) -> (-)-
9. (-)<extraSpace>- -> (-)-
10. (-)<extraSpace> -> (-)-
11. * -> removed
12. -- -> -
13. removing special characters.
14. removing extra introduced characters.
Total 2974
clean_name entries generates cid
. It needs to clean names to more extent after looking their present name entries into the database - PubChem or ChEBI.
This is useful progress. I note that you have resolved 2974 (out of 7171) compounds in Pubchem (i.e. found CIDs). I don't believe this is the total possible (see below)
I note also that there are synonyms: C1369 1-butanol 1-butanol 263 C3513 butan-1-ol butan-1-ol 263 C3517 butanol butanol 263In the final table this should be a single logical entry, with synonyms. It depends on how the table holds this. There could be a list with a separator (e.g. "|") or a separate compoundSynonymTable. At this stage I'd suggest the former.
However I think there are many compounds which can be resolved. Thus
2976b (-)-beta-ocimene is marked NA but a manual Pubchem search for "(-)-beta-ocimene" gives
(Z)-BETA-OCIMENE; cis-beta-Ocimene; cis-Ocimene; (Z)-3,7-Dimethylocta-1,3,6,-triene; beta-cis-Ocimene; ... https://pubchem.ncbi.nlm.nih.gov/compound/5320250 Compound CID: 5320250 https://pubchem.ncbi.nlm.nih.gov/compound/5320250 MF: C10H16 https://pubchem.ncbi.nlm.nih.gov/search/#query=C10H16 MW: 136.23g/mol InChIKey: IHPKGUQCSIINRJ-NTMALXAHSA-N IUPAC Name: (3Z)-3,7-dimethylocta-1,3,6-triene Create Date: 2005-03-27
I suspect that there are a lot of other entries that could be resolved.
Capitalization: C811 (2r,5e)-caryophyll-5-en-12-al (2r,5e)-caryophyll-5-en-12-al NA C812 (2r,5s)-caryophyll-5-en-12-al (2r,5s)-caryophyll-5-en-12-al NA C813 (2s,5e)-caryophyll-5-en-12-al (2s,5e)-caryophyll-5-en-12-al
should be (2R,5E)
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Yes sir, There are synonyms present into present table that need to be normalized as next step.
Resolving more entries are possible after following PubChem lookup as in replacing
(2r,5e) -> (2R,5E)
(-) or (+) -> (Z) or (E)
-alpha- or -beta- or -gamma- -> -.alpha.- or -.beta.- or -.gamma.-
and so on.
Keeping synonyms into same table with separator "|" would be better.
BE VERY VERY VERY VERY CAREFUL. You CANNOT equate R/S with +/- You CANNOT equate cis/trans with E/Z NEVER The only thing that can be automatically normalized is (e) -> (E) (also Z, R, S)
On Sun, Aug 18, 2019 at 1:48 PM Ambarish Kumar notifications@github.com wrote:
Yes sir, There are synonyms present into present table that need to be normalized as next step.
Resolving more entries are possible after following PubChem lookup as in replacing
(2r,5e) -> (2R,5E)
YES
(-) or (+) -> (Z) or (E)
NO NO NO NO NO NO NEVER NO
-alpha- or -beta- or -gamma- -> -.alpha.- or -.beta.- or -.gamma.-
NO alpha is often part of the name. It is difficult to know what to do with the Greek characters. I would leave them as Unicode. The only thing that really matters is the InChI Everything else can be looked up from that.
and so on.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZQ4APTAHWY77CQRALQFFAIVA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4Q7LRQ#issuecomment-522319302, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZMSSCJDA3WSGL34LLQFFAIVANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Did you find out why only 2974 compounds were looked up? Also I would suggest using ChEBI where possible. I think it;s better quality than PubChem.
On Sun, Aug 18, 2019 at 2:31 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:
BE VERY VERY VERY VERY CAREFUL. You CANNOT equate R/S with +/- You CANNOT equate cis/trans with E/Z NEVER The only thing that can be automatically normalized is (e) -> (E) (also Z, R, S)
On Sun, Aug 18, 2019 at 1:48 PM Ambarish Kumar notifications@github.com wrote:
Yes sir, There are synonyms present into present table that need to be normalized as next step.
Resolving more entries are possible after following PubChem lookup as in replacing
(2r,5e) -> (2R,5E)
YES
(-) or (+) -> (Z) or (E)
NO NO NO NO NO NO NEVER NO
-alpha- or -beta- or -gamma- -> -.alpha.- or -.beta.- or -.gamma.-
NO alpha is often part of the name. It is difficult to know what to do with the Greek characters. I would leave them as Unicode. The only thing that really matters is the InChI Everything else can be looked up from that.
and so on.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZQ4APTAHWY77CQRALQFFAIVA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4Q7LRQ#issuecomment-522319302, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZMSSCJDA3WSGL34LLQFFAIVANCNFSM4ICLYMFQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Ok sir. I will be careful and will check for the isomeric-notations available into the database.
Due tonaming conventions
and inconsistencies
of available compound names into the EssoilDB with that of PubChem repository, I could get 2974 available entries.
Sir, I tried using ChEBI API but it could take ChEBI IDs for processing any query. Also, I found it more meant for chemical compounds annotations. Please send me exact ChEBI API
which is suitable to the situation and can take compound name as an initial input.
I don't understand why the PubChemAPI doesn't get a large number of the entries marked NA. First 6 Examples immediately after 2974
(-)-ar-curcumen-15-al NA ar-Curcumen-15-al; XVWGGKCJOXAGDW-UHFFFAOYSA-N https://pubchem.ncbi.nlm.nih.gov/compound/10846393 Compound CID: 10846393 https://pubchem.ncbi.nlm.nih.gov/compound/10846393
MF: C15H20O https://pubchem.ncbi.nlm.nih.gov/search/#query=C15H20O MW: 216.32g/mol InChIKey: XVWGGKCJOXAGDW-UHFFFAOYSA-N IUPAC Name: 4-(6-methylhept-5-en-2-yl)benzaldehyde Create Date: 2006-10-26
C766 (-)-beta-ocimene (-)-beta-ocimene NA (Z)-BETA-OCIMENE; cis-beta-Ocimene; cis-Ocimene; (Z)-3,7-Dimethylocta-1,3,6,-triene; beta-cis-Ocimene; ... https://pubchem.ncbi.nlm.nih.gov/compound/5320250 Compound CID: 5320250 https://pubchem.ncbi.nlm.nih.gov/compound/5320250 MF: C10H16 https://pubchem.ncbi.nlm.nih.gov/search/#query=C10H16 MW: 136.23g/mol InChIKey: IHPKGUQCSIINRJ-NTMALXAHSA-N IUPAC Name: (3Z)-3,7-dimethylocta-1,3,6-triene
C770 (-)-elema-1,3,11(13)-trien-12-al (-)-elema-1,3,11(13)-trien-12-al NA SCHEMBL14215827; DJZHNAGRSWMVPA-QLFBSQMISA-N; (-)-Elema-1,3,11(13)-trien-12-al https://pubchem.ncbi.nlm.nih.gov/compound/11651448 Compound CID: 11651448 https://pubchem.ncbi.nlm.nih.gov/compound/11651448
MF: C15H22O https://pubchem.ncbi.nlm.nih.gov/search/#query=C15H22O MW: 218.33g/mol InChIKey: DJZHNAGRSWMVPA-QLFBSQMISA-N IUPAC Name: 2-[(1R,3S,4S)-4-ethenyl-4-methyl-3-prop-1-en-2-ylcyclohexyl]prop-2-enal Create Date: 2006-10-26
C837 (Ŕ) Ŕ gamma -curcumen-15-al (-)-gamma-curcumen-15-al NA (-)-.gamma.-curcumen-15-al; IAYOZXCTYXYCHP-UHFFFAOYSA-N https://pubchem.ncbi.nlm.nih.gov/compound/91747467 Compound CID: 91747467 https://pubchem.ncbi.nlm.nih.gov/compound/91747467
MF: C15H22O https://pubchem.ncbi.nlm.nih.gov/search/#query=C15H22O MW: 218.33g/mol InChIKey: IAYOZXCTYXYCHP-UHFFFAOYSA-N IUPAC Name: 4-(6-methylhept-5-en-2-yl)cyclohexa-1,3-diene-1-carbaldehyde Create Date: 2015-04-28
C772 (-)-kaur-16-en-19-al (-)-kaur-16-en-19-al NA ent-Kaurenal; ent-Kaur-16-en-19-al; CHEBI:15418; LMPR0104130005 https://pubchem.ncbi.nlm.nih.gov/compound/10062561 Compound CID: 10062561 https://pubchem.ncbi.nlm.nih.gov/compound/10062561
MF: C20H30O https://pubchem.ncbi.nlm.nih.gov/search/#query=C20H30O MW: 286.5g/mol InChIKey: JCAVDWHQNFTFBW-GNVSMLMZSA-N IUPAC Name: (1S,4S,5R,9S,10R,13S)-5,9-dimethyl-14-methylidenetetracyclo[11.2.1.01,10.04,9]hexadecane-5-carbaldehyde
C775 (-)-pacifigorgia-1(6),10-diene (-)-pacifigorgia-1(6),10-diene NA Pacifigorgia-1(6),10-diene; VGMZAEHYZOQRSK-HUBLWGQQSA-N https://pubchem.ncbi.nlm.nih.gov/compound/12051852 Compound CID: 12051852 https://pubchem.ncbi.nlm.nih.gov/compound/12051852
MF: C15H24 https://pubchem.ncbi.nlm.nih.gov/search/#query=C15H24 MW: 204.35g/mol InChIKey: VGMZAEHYZOQRSK-HUBLWGQQSA-N IUPAC Name: (1S,4R,5S)-1,5-dimethyl-4-(2-methylprop-1-enyl)-2,3,4,5,6,7-hexahydro-1H-indene
So I get 6 out of 6 on the Manual Pubchem API . Please check that you can retrieve these as well. If there are problems with the API we need to find them.
There are some corruputed names:
C6522 thuj-3-en-10-a1 thuj-3-en-10-a1 NA
This is a typo - should be thuj-3-en-10-al
C7036 propyl sovalerate propyl sovalerate NA This is a typo - should be propyl isovalerate
DO NOT TRY TO CORRECT THESE - leave them to ME.
So REMOVE all mixtures (compound + compound). This probaby needs to be done manually USE Pubchem to resolve as many names as possible. Then create a list of unresolved names for manually checking. I do not expect more than 500 unresolved names
Then we will aggregate duplicates.
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir, Removed all comp+comp and comp/comp mixtures.
The reason what I find behind PubChem generated NA entries corresponding to majority of compound names (used as an query input) is unavailability of their synonyms mentioned by depositor into the PubChem.
Resolving all remaining names using PubChem REST API.
Please go through the first 208 findings as first batch job - sheet.
Table columns are as follows.
70 entries are available into PubChem and rest are not searchable.
Example for not-retrieved entries are as follows. PubChem_lookup is generated after truncating the names. Is it a right way to get PubChem_lookup and retrieve compound_cid?
ID not-retrieved entries PubChem_lookup Compound_CID
C898 (E)-3-hexanoic acid HEXANOIC ACID 8892
C5 (E)-2,2-decenal not found NA
C4 (E)-2,(Z)-6-decadienal 2,6-Decadienal 5283350
C893 (E)-2-undecenol Undecenol 22506525
C891 (E)-2-undecanal UNDECANAL 8186
C799 (2)-3-hexenylacetate Cis-3-Hexenyl Acetate 5363388
C800 (2)-3-hexenylbenzoate Cis-3-HEXENYLBENZOATE 32809
Search for C916
(E)-9-Epi-Caryophyllene
generates (Z)-Caryophyllene; (Z)-.Beta.-Caryophyllene; 9-Epicaryophyllene; 9-Epi-Caryophyllene
with compund CID
- 6429301.
Search for (E)-bisabol-11-ol
generates (Z)-Bisabol-11-Ol; AXLLSNSRONSXGV-MLPAPPSSSA-N
with compound CID
- 91750291.
For the searches as in above both, I concern about E
and Z
isomerism.
Search for C931 (E)-b-ocimene
generates OCIMENE; (E)-Beta-Ocimene; Trans-Beta-Ocimene; 13877-91-3; Beta-Ocimene; Trans-Ocimene; (E)-3,7-Dimethylocta-1,3,6-Triene; (3E)-3,7-Dimethylocta-1,3,6-Triene;
with compound CID - 5281553
sir, should I keep the lookup result?
All Original names are same as before (as a separate column).
Chemical nomenclature is complex and ambiguous. Any attempt to disambiguate MUST record ambiguity. Thus acetyl-furan could be 1-acetyl-furan or 2-acetyl-furan, OPSIN (https://opsin.ch.cam.ac.uk) gives:
and this must be recorded
Always test with OPSIN.