gilienv / EssOilDB

Restructuring of Essential Oil Database
Apache License 2.0
8 stars 6 forks source link

Disambiguating chemistry and fixing typos #76

Open petermr opened 5 years ago

petermr commented 5 years ago

Chemical nomenclature is complex and ambiguous. Any attempt to disambiguate MUST record ambiguity. Thus acetyl-furan could be 1-acetyl-furan or 2-acetyl-furan, OPSIN (https://opsin.ch.cam.ac.uk) gives:

APPEARS_AMBIGUOUS: Connection of acet to furan

and this must be recorded

Always test with OPSIN.

petermr commented 5 years ago

== create sample disambiguation of chemistry ==

For each lookup go to the site and lookup the name. Record the ID if found, else leave empty. If there are special comments record them.

This may be automatable through Egon's tools.

petermr commented 5 years ago

Chemical disambiguation

The tools to use are:

By using InChIs we have a correspondence between the systems.

INPUTS From the CSV file output column 2 (common names). Edit out quotes (") and delete spaces round " - "; split esters "bornylacetate" => bornyl acetate.

OUTPUTS If pubchem has an ambigous compound it outputs stereo isomers. These may need editing manually to give the commonest.

Typical example for https://pubchem.ncbi.nlm.nih.gov/compound/42608158 shows which the most likely isomer is for alloaromadendrene (Allo-Aromadendrene)

petermr commented 5 years ago

Scheduling chemical work

Vinita should supervise the processing, which will be largely carried out by Ambarish and later Shruthi.

It is particularly important to check correctness of results.

Method: Divide the work into small batches (Pubchem may mandate this, but it's good practice). At this stage no more than 100 compounds per batch

0/ There should be a single communal table (as described). There may need to be more columns than specified there. 1/ run batch vs Pubchem to get (a) CIDs (b) InChIs. Add comments (c) where Pubchem has failed or is ambiguous. 2/ run batch on OPSIN to get (d) InChIs and (e) comments.

3/ search Wikidata with (a) CID (b) InChI (c) original name if fails . This should be done automatically . For unambiguous compounds this will give a link to Wikidata that should be included in the EssoilDB database.

The correctness of the search will be shown by matching InChIs for numerous compounds. We will report early results in the poster.

ambarishK commented 5 years ago

Sir, Please go through the Batch-0 run for the first 100 compounds. compNameDisambiguation.csv(https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/compNameDisambiguation.csv) - output file for EssOilDB entry, PubChem lookups, OPSIN lookups and comments.

Wikidata entry is remaining right now.

ambarishK commented 5 years ago

Sir, Please go through the files. 100cnamePubchemAndOPSIN.csv 100cnamePubChem.csv 100cnameOPSIN.csv

PubChem lookup generates isomers. Those are present into the file as output is generated (also order of PubChem lookup entries are same as of generated output.)

petermr commented 5 years ago

Files are meaningless unless they have documentation. Please briefly record (on Github) how these files were created.

Also I will probably move these files in the directory structure

On Mon, Jul 15, 2019 at 12:27 PM Ambarish Kumar notifications@github.com wrote:

Sir, Please go through the files. 100cnamePubchemAndOPSIN.csv https://github.com/gilienv/EssOilDB/blob/master/100cnamePubchemAndOPSIN.csv 100cnamePubchem.csv https://github.com/gilienv/EssOilDB/blob/master/100cnamePubchem.csv 100cnameOPSIN.csv [https://github.com/gilienv/EssOilDB/blob/master/100cnameOPSIN.csv]

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZK47ITAATNHSC53J3P7RNK5A5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ5M7SI#issuecomment-511365065, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS53XUTSVE6D6DREN2DP7RNK5ANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 5 years ago

I had better luck fixing chemical names with this: https://www.ncbi.nlm.nih.gov/pcsubstance/?term=%22(Z)-BETA-OCIMENE%22

Not so much luck with this: https://opsin.ch.cam.ac.uk/

EmanuelFaria commented 5 years ago

This one is pretty good too: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:10447

gilienv commented 5 years ago

Thank you Manny!

Ambarish - Requesting you to look up Manny's suggestions above and check how we fare in terms of Chemical Disambiguation

gilienv commented 5 years ago

As PMR had first pointed out - we need to document the KINDS of errors we have in the Chemistry.

At present, the most comprehensive assessment of types of errors has been conducted by Manny, and we have had a few meetings to discuss various issues.

More on my dropbox, but happy to add here if Ambarish initiates a list of Error types, along with V.1 entries for each kind

petermr commented 5 years ago

We are clearly going to have to do manual correction of chemical names. Common problems include:

To be correct we should have at least 2 columns (raw data, curated data)

On Wed, Jul 17, 2019 at 8:59 AM Gitanjali Yadav notifications@github.com wrote:

As PMR had first pointed out - we need to document the KINDS of errors we have in the Chemistry.

At present, the most comprehensive assessment of types of errors has been conducted by Manny, and we have had a few meetings to discuss various issues.

More on my dropbox, but happy to add here if Ambarish initiates a list of Error types, along with V.1 entries for each kind

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS6HRCJTONJWSHPKPRDP73GNPA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DL4XI#issuecomment-512147037, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS3ISTO5SRK6MF3GAFTP73GNPANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

We generated 1000 records of compounds using OPSIN and PubChem. For getting WIKIDATA lookup column, we will have to reset the run. Preparing run for getting WIKIDATA and WIKIPEDIA lookups.

petermr commented 5 years ago

Ambarish has made good progress on disambiguation - see https://github.com/gilienv/EssOilDB/blob/master/chemistry/EssOilDBOPSINPubChem.tsv This has lookups on trivial names (coname). (Trivial means "commonly used", not algorithmically parsable). (@mannyrules note that OPSIN is created for systematic names and has a limited number of trivial names. By contrast Pubchem and ChEBI have a lot of trivial names but cannot parse systematic names that aren't in its database. So OPSIN+Pubchem/ChEBI should catch most.

@ambarishK and I had a good discussion today. The result in OPSINPubChem is:

ACTION we will need (at least) three columns

The benefit is that @mannyrules and other volunteers (@petermr ) can edit this on a day-by-day basis without affecting the rest of the submission.

Both Pubchem and OPSIN produce InChIs if successful. We should find out as soon as possible when InChIs don't agree as this will probably be an important new problem.

petermr commented 5 years ago

Created a new table EssOilDBOPSINPubChemInChI.csv with some columns removed and sorted. This is just for more rapid comparison of InChIs. Ignore it.

EmanuelFaria commented 5 years ago

I have been and will continue to relatively quickly replace errors in punctuation as well as “foreign” characters (eg, Ã, ã) etc.. I have also created a little table for myself where I am storing other, stranger anomalies such as things that look like spaces, but are actually some indescribable character.

Each time I find one, I save it so I can go through all of them “one last time” after the last person has touched the data.

I don’t know the cause of this strange data. It could be that we are each using different keyboard language settings, operating systems, or different dictionaries as default in our spreadsheet programs.

No matter though. I’m confident I can clean that stuff up.

My biggest limitation is not knowing what’s actually correct or incorrect. But on the other hand, my layman’s eyes see things others may miss, so together we’ll ferret out the weirdness.

Sent with GitHawk

petermr commented 5 years ago

A very quick eyeball of InChIs

Of the 1000 names, approximately 700 were translated by PubChem and 400 by OPSIN (though there is still a punctuation problem and this number should increase. There are 300 whcih have InChIs from both and I have only spotted 3-4 which are grossly different (mainly because OPSIN doesn't have the right systematica names (e.g. terpinen-4-ol is a derivative of terpinene but OPSIN doesn't have this trivial name and translates it as ter[pinene]-4-ol - 3 pinenes stitched together. But generally OPSIN agrees with Pubchem ca 99% which is great. @vinitamehlawat we can report this figure.

petermr commented 5 years ago

Brilliant, The main thing is to record everything and try to systematize the errors. For example: EXTRA_SPACE MISSING_SPACE INVISIBLE_CHAR

Then we can analyze what is most frequent.

I agree with you that there may be an invisible character problem. This might come from non-Unicode characters that cannot be rendered. Believe me, I know most of the "tricks"

We should only use ASCII characters (32-126). No clever spaces (non-breaking space, zero-width space, etc.). No greek characters (=> beta, etc.) No em-dashes (only hyphen-minus), no umlauts and other diacritics. Quoting is a real problem and in general No Quotes or apostrophes.

I don't think we can "correct" any of this algorithmically and if we do I suggest that I do it.

P.

On Thu, Jul 18, 2019 at 4:46 PM Manny notifications@github.com wrote:

I have been and will continue to relatively quickly replace errors in punctuation as well as “foreign” characters (eg, Ã, ã) etc.. I have also created a little table for myself where I am storing other, stranger anomalies such as things that look like spaces, but are actually some indescribable character.

Each time I find one, I save it so I can go through all of them “one last time” after the last person has touched the data.

I don’t know the cause of this strange data. It could be that we are each using different keyboard language settings, operating systems, or different dictionaries as default in our spreadsheet programs.

No matter though. I’m confident I can clean that stuff up.

My biggest limitation is not knowing what’s actually correct or incorrect. But on the other hand, my layman’s eyes see things others may miss, so together we’ll ferret out the weirdness.

Sent with GitHawk http://githawk.com

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS65JSFGY4ZH26AGKQ3QACF6HA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2I5HNI#issuecomment-512873397, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2CTSQAGS24GFOF6JTQACF6HANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Table for name correction

I will start adding records after meeting today. Also, I will draft all possibilities of name inconsistencies with example.

ambarishK commented 5 years ago

Sir

I prepared a fresh sheet for name cleaning.

It containes exact 7162 unique compound records.

The we discussed today is there as it is. - https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/EssOilDBOPSINPubChemInChIs_A.csv

It has 7169 unique compound entries.

It is better to continue with the today discussed sheet.

I tried to get into the difference of 07 records. It may be because of repeated 07 compound names.

Documentation for generating sheet is at

petermr commented 5 years ago

On Fri, Jul 19, 2019 at 10:04 AM Ambarish Kumar notifications@github.com wrote:

Table for name correction https://github.com/gilienv/EssOilDB/blob/master/chemistry/EssOilDBOPSINPubChemInChIs_A.csv

We cannot create additional tables until we have agreed the identifiers.

I will start adding records after meeting today. Also, I will draft all

possibilities of name inconsistencies with example.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS6TZHZWCWN2VTT6CPDQAF7SBA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2LBYDI#issuecomment-513154061, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4VBVQ2NTYAQ4VW2BLQAF7SBANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Sir

Check for the sheet EssOilDBOPSINPubChemInChIsANewFinal.csv.

It contains exactly same identifiers as of the first sheet (the finalised one) - EssOilDBOPSINPubChemInChIs_A.csv

Removing sheets - EssOilDBOPSINPubChemInChIsANew.csv and EssOilDBOPSINPubChemInChIsANew.tsv

ambarishK commented 5 years ago

We are clearly going to have to do manual correction of chemical names. Common problems include:

  • misspelling e.g - 1,8-cineol
  • spaces included "alpha - pinene" e.g - 1,2,3,4-Tetrahydro-1,5,7-trimethyl naphthalene
  • spaces omitted "ethylacetate" e.g - (e)-sesquilavandulylacetate
  • hypens omitted/included e.g - 1,8 cineole
  • quotes (strange, unbalanced...) e.g - (2,4)-nonadienal
  • multiple locants

EssOilDB entry is "bergamotol acetate" but PubChem search shows - Trans-.alpha.-Bergamatol Acetate OR (Z)-.Alpha.-Bergamotol Acetate OR Cis-alpha-Bergamotol Acetate.

  • missing locants

e.g - borneole

e.g - 1,4-cadinadienea

e.g humulene epoxide iii .It should have been humulene epoxide III.

e.g EssOilDBEntry - hexadecanoic0acid. It should have been Hexadecanoic acid.

e.g - EssOilDBEntry is (2e)-octen-1-ol. It should have been E-2-octen-1-ol.

To be correct we should have at least 2 columns (raw data, curated data)

petermr commented 5 years ago

Thanks, Yes, This is a difficult area and we are going to have to treat it carefully and systematically. It is essential to preserve the original spelling regardless of whether it is "wrong" or "right". So we must have a column for raw name. There are names with are very similar but represent different compounds. If we "correct" these we will corrupt the database. Thus:

decanol decanal decenol decenal

are all valid names and are all distinct. (if the original abstracter made a copying error it may be difficult to detect)

On Mon, Jul 22, 2019 at 8:56 AM Ambarish Kumar notifications@github.com wrote:

We are clearly going to have to do manual correction of chemical names. Common problems include:

  • misspelling e.g - 1,8-cineol

This is not a misspelling, it's a synonym. See https://pubchem.ncbi.nlm.nih.gov/compound/Eucalyptol which lists

2.4Synonyms Help New Window https://pubchem.ncbi.nlm.nih.gov/compound/Eucalyptol#section=Synonyms&fullscreen=true 2.4.1MeSH Entry Terms Help New Window https://pubchem.ncbi.nlm.nih.gov/compound/Eucalyptol#section=MeSH-Entry-Terms&fullscreen=true

1,8 Cineol

1,8 Cineole

1,8 Epoxy p menthane

1,8-cineol

1,8-cineole

1,8-Epoxy-p-menthane

cineole

eucalyptol

Soledum

  • spaces included "alpha - pinene" e.g - 1,2,3,4-Tetrahydro-1,5,7-trimethyl naphthalene

Yes!

  • spaces omitted "ethylacetate" e.g - (e)-sesquilavandulylacetate

Yes

  • hypens omitted/included e.g - 1,8 cineole

Yes

  • quotes (strange, unbalanced...) e.g - (2,4)-nonadienal

Yes

  • multiple locants

  • missing locants

We should create short unique codes for this:

examples SYNONYM ADDED_SPACE MISSING_SPACE MISSING_HYPHEN ADDED_HYPHEN QUOTE_ERROR MULTIPLE_LOCANT MISSING_LOCANT

By using codes like this (always uppercase) we can normalize the reporting of errors.

To be correct we should have at least 2 columns (raw data, curated data)

… <#m_-1617261654995852357_m7121751399519024663> On Wed, Jul 17, 2019 at 8:59 AM Gitanjali Yadav @.***> wrote: As PMR had first pointed out - we need to document the KINDS of errors we have in the Chemistry. At present, the most comprehensive assessment of types of errors has been conducted by Manny, and we have had a few meetings to discuss various issues. More on my dropbox, but happy to add here if Ambarish initiates a list of Error types, along with V.1 entries for each kind — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#76 https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS6HRCJTONJWSHPKPRDP73GNPA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DL4XI#issuecomment-512147037>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS3ISTO5SRK6MF3GAFTP73GNPANCNFSM4ICLYMFQ . -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSY52VQRXDO7IE6BODLQAVR2FA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2PC5PY#issuecomment-513683135, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS36BGGEQA4SZIKWWV3QAVR2FANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Please go through the name cleaning sheet updated by me - Copy of EssOilDBOPSINPubChemInChIs_A.csv.

There is additional column for IUPAC name. I have added first 50 records into it.

I added a short description of today file. documentation page

ambarishK commented 5 years ago

Sir,

Compound_identifiers are now as C1,C2,C3 ......which corresponds to previous identifiers 1C, 2C, 3C ...... respectively. Updated sheet with compound_identifier.

petermr commented 5 years ago

I will revisit this after I have created the poster. We have to start again and document exacty what we start with and what operations we carry out. The confusing thing was the IUPAC names which were not in the original V1.0 (as far as I know). In fact there is only one compound name and possibly a CAS number.

But I have to talk with Gita first.

On Tue, Jul 23, 2019 at 1:08 PM Ambarish Kumar notifications@github.com wrote:

Sir, Please go through the documentation page https://github.com/gilienv/EssOilDB/blob/master/chemistry/Disambiguating_chemistry_and_fixing_typos.md. I added a short description of today file.

Compound_identifiers are now as C1,C2,C3 ......which corresponds to 1C, 2C, 3C ...... respectively. Updated sheet with compound_identifier https://github.com/gilienv/EssOilDB/blob/master/chemistry/EssOilDBOPSINPubChemInChIs_A.csv .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS5KB7RRFGU523U7PQDQA3YFTA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2S4YOQ#issuecomment-514182202, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS7GIRTET33XU7BBOG3QA3YFTANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Sir, I have listed WIKIDATA 'Q' ID for all compounds onto the poster. Please go through the page

petermr commented 5 years ago

Thank you. I will look

On Wed, Jul 24, 2019 at 1:11 PM Ambarish Kumar notifications@github.com wrote:

Sir, I have listed WIKIDATA 'Q' ID for all compounds onto the poster. Please go through the page https://github.com/gilienv/EssOilDB/blob/master/EssOilDBPosterWIKIDATA-QID.md

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS2GPQ2ZT2XJWXS6NWDQBBBGHA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2WEDUY#issuecomment-514605523, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZR3QRKNVS56FFSKATQBBBGHANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Dear Sir

One pic I found in my mobile camera roll, It is of harvesting time ( of this March when I had visited my home ). Pic has Lantana camara shrubs spread at the bottom. If convenient, it can be included into the poster.

petermr commented 5 years ago

Thanks Ambarish, Nice offer, but I had to send the poster off today.

On Thu, Jul 25, 2019 at 7:08 AM Ambarish Kumar notifications@github.com wrote:

Dear Sir

One pic https://github.com/gilienv/EssOilDB/blob/master/assets/IMG_9575.JPG I found in my mobile camera roll, It is of harvesting time ( of this March when I had visited my home ). Pic has Lantana camara shrubs spread at the bottom. If convenient, it can be included into the poster.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS6AES7GHQNAXWI3YULQBE7PVA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2YOTDA#issuecomment-514910604, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS5XHR7LVMCKCUCUYHLQBE7PVANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Sir, I am adding cleaned names to the sheet.

Column description is as follows.

  1. Compound_identifier - Unique ID assigned to each compounds.
  2. Original_name - Original name of compounds mentioned into the database.
  3. cleaned_name - Cleaned name of compounds.
  4. name_comments - comments for name cleaning.

Cleaned names are obtained from PubChem ( as compound name lookup).

I am adding cleaned_name from starting entries (from beginning).

Few names are not retrieved. Ex -

ID Original_name C4. (e)-2,(z)-6-decadienal C5. (e)-2,2-decenal C7. (e)-2-decanal C13. (e)-2-hexyl butyrate C20. (e)-anethole+bornyl acetate

Please suggest for any correction or changes. Also, we will have to check weather all cleaned_name generate InChIKey or not.

petermr commented 5 years ago

Thank you, Will have a look. This is a very important table Do they all have E2.0 identifiers

On Fri, Aug 16, 2019 at 8:58 AM Ambarish Kumar notifications@github.com wrote:

Sir, I am adding cleaned names to the sheet https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/CopyEssOilDBOPSINPubChemInChIs_A_Sheet1.csv .

Column description is as follows.

  1. Compound_identifier - Unique ID assigned to each compounds.
  2. Original_name - Original name of compounds mentioned into the database.
  3. cleaned_name - Cleaned name of compounds.
  4. name_comments - comments for name cleaning.

Cleaned names are obtained from PubChem ( as compound name lookup).

I am adding cleaned_name from starting entries (from beginning).

Few names are not retrieved. Ex -

ID Original_name C4. (e)-2,(z)-6-decadienal C5. (e)-2,2-decenal C7. (e)-2-decanal C13. (e)-2-hexyl butyrate C20. (e)-anethole+bornyl acetate

Please suggest for any correction or changes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS6VFTR5MAGMOKLB3OTQEZM3FA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4N6UEY#issuecomment-521923091, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS5QC4CZAEFN6J45RT3QEZM3FANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Yes sir.

petermr commented 5 years ago

Please do NOT use capital letters in chemical names, except for atoms and stereo identifiers

(Z)-Alpha-Bisabolene should be (Z)-alpha-bisabolene

In general PubChem or ChEBI will give the correct capitalization.

Common rules E, Z, R, S capitalized (E)-but-2-ene, (R,S)-tartaric acid, o-, m-,p- lowercase o-cresol , p-menthane N- capitalized N-ethyl succinimide

But the safe way is to look this up.

This table will be a very important resource for the future.

On Fri, Aug 16, 2019 at 9:07 AM Ambarish Kumar notifications@github.com wrote:

Yes sir.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS3QX3BJNZ5GJTKP4S3QEZN47A5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4N7G3I#issuecomment-521925485, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2FGWFXN7C65XO3QFLQEZN47ANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 5 years ago

THERE IS A VERY VERY SERIOUS ERROR IN COLUMN 12 WHICH CORRUPTS THE TABLE COMPLETELY.

The column is called "2,4-nonadienal" which is a VALUE not a name.

THIS MEANS THAT EVERY VALUE IN THIS COLUMN POINTS TO THE WRONG COMPOUND (PROBABLY OFF-BY-ONE).

PLEASE FIND WHERE THE ERROR OCCURRED . DO NOT HAND EDIT THE TABLE. IF YOU GET THIS WRONG IT WILL DESTROY THE TABLE FOR EVER.

Correct the software that generates the table and re-generate it .

What is the actual name of this column?

On Fri, Aug 16, 2019 at 9:18 AM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

Please do NOT use capital letters in chemical names, except for atoms and stereo identifiers

(Z)-Alpha-Bisabolene should be (Z)-alpha-bisabolene

In general PubChem or ChEBI will give the correct capitalization.

Common rules E, Z, R, S capitalized (E)-but-2-ene, (R,S)-tartaric acid, o-, m-,p- lowercase o-cresol , p-menthane N- capitalized N-ethyl succinimide

But the safe way is to look this up.

This table will be a very important resource for the future.

On Fri, Aug 16, 2019 at 9:07 AM Ambarish Kumar notifications@github.com wrote:

Yes sir.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS3QX3BJNZ5GJTKP4S3QEZN47A5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4N7G3I#issuecomment-521925485, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2FGWFXN7C65XO3QFLQEZN47ANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Sir,

Removed column 12. It got left over the sheet, as I had put it over there to get their CIDs and InChIKey.

ambarishK commented 5 years ago

Sir, Making all changes related the case of letters used in nomenclature.

For name cleaning I have to go for each compound name PubChem lookup. If I make changes to case of letters or notations used into nomenclature (for entire Original_name column over excel sheet), how will I verify for the cleaned_name?

Till now I gone for PubChem compound name lookup for each compounds manually. Please suggest how to proceed further.

petermr commented 5 years ago

Pubchem and ChEBI and Wikidata all have APIs for automatic lookup. You should never use manual ones for more than 10. All these systems have RESTful APIs. You construct a URL and then use curl or similar system to query the system. The results come back as JSON or XML depending on what is available. Some of them allow multiple queries in a batch.

ambarishK commented 5 years ago

Sir, How to select or keep CID in case multiple CIDs are generated for same compound name?

For example.

trans-caryophyllene 5281522 trans-caryophyllene 5281515

isocaryophyllene oxide 14350 isocaryophyllene oxide 1742211

petermr commented 5 years ago

This is a real and serious problem. @mannyrules take note. Chemical names are sometimes used to represent different structures, either because of generic nature or mistakes. Here are the entries for caryophyllene in Pubchem:

5281515 beta-Caryophyllene THIS NAME IS AMBIGUOUS , IT DOESN'T GIVE THE CHIRALITY

PubChem CID: 5281515 Structure: [image: beta-Caryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=2D-Structure [image: beta-Caryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID5281515 structure&tab=similarity Chemical Safety: [image: Irritant] [image: Health Hazard] https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=Safety-and-Hazards Laboratory Chemical Safety Summary (LCSS) Datasheet https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#datasheet=LCSS Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:

BETA-CARYOPHYLLENE

Caryophyllene

(-)-trans-Caryophyllene

(-)-beta-caryophyllene

THIS IS A DEFINED STEREO ISOMER

87-44-5 More... https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:

204.35 g/mol Dates:

(-)-beta-caryophyllene is a beta-caryophyllene in which the stereocentre adjacent to the exocyclic double bond has S configuration while the remaining stereocentre has R configuration. It is the most commonly occurring form of beta-caryophyllene, occurring in many essential oils, particularly oil of cloves. It has a role as a non-steroidal anti-inflammatory drug, a fragrance and a metabolite. It is an enantiomer of a (+)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene.

beta-Caryophyllene, also known as caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/caryophyllene or (−)-β-caryophyllene, is a natural bicyclic sesquiterpene that is a constituent of many essential oils including that of Syzygium aromaticum (cloves), Cannabis sativa, rosemary, and hops. It is usually found as a mixture with isocaryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/isocaryophyllene (the cis double bond isomer) and α-humulene (obsolete name: α-caryophyllene), a ring-opened isomer. beta-Caryophyllene is notable for having both a cyclobutane https://pubchem.ncbi.nlm.nih.gov/compound/cyclobutane ring and a trans-double bond in a nine-membered ring, both rarities in nature . beta-Caryophyllene is a sweet and dry tasting compound that can be found in a number of food items such as allspice, fig, pot marjoram, and roman camomile, which makes beta-caryophyllene a potential biomarker for the consumption of these food products. beta-Caryophyllene can be found in feces and saliva.

2.1.2InChI *** THIS IS THE FUNDAMENTAL REPRESENTATION InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6+/t13-,14-/m1/s1

===================

=========5281522==========

Isocaryophyllene PubChem CID: 5281522 Structure: [image: Isocaryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=2D-Structure [image: Isocaryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID5281522 structure&tab=similarity Chemical Safety: [image: Irritant] [image: Health Hazard] https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=Safety-and-Hazards Laboratory Chemical Safety Summary (LCSS) Datasheet https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#datasheet=LCSS Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:

Isocaryophyllene

Caryophyllene

UNII-NRY8I0KNIR

beta-Caryophyllen

THIS NAME IS WRONG

gamma-caryophyllene More... https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:

204.35 g/mol Dates:

Isocaryophyllene is a sesquiterpenoid.

Isocaryophyllene, also known as gamma-caryophyllene, belongs to the class of organic compounds known as sesquiterpenoids. Sesquiterpenoids are terpenes with three consecutive isoprene https://pubchem.ncbi.nlm.nih.gov/compound/isoprene units. Isocaryophyllene can be found primarily in saliva. Isocaryophyllene is found in allspice, and is widespread in plants (Jasminum, Origanum, and Pimpinella species).

2.1.2InChI * THE INCHI IS DIFFERENT https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=InChI&fullscreen=true*

InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6-/t13-,14-/m1/s1

===============(+)-beta-caryophyllene====

There is an enatiomer of

5281515 which is

===== 20831623==== (+)-beta-Caryophyllene PubChem CID: 20831623 Structure: [image: (+)-beta-Caryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=2D-Structure [image: (+)-beta-Caryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID20831623 structure&tab=similarity Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:

(+)-beta-caryophyllene

(+)-caryophyllene

(1S,4E,9R)-4,11,11-trimethyl-8-methylidenebicyclo[7.2.0]undec-4-ene

trans-(1S,9R)-4,11,11-trimethyl-8-methylenebicyclo[7.2.0]undec-4-ene

10579-93-8 More... https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:

204.35 g/mol Dates:

(+)-beta-caryophyllene is a beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene in which the stereocentre adjacent to the exocyclic double bond has R configuration while the remaining stereocentre has S configuration. It is the enantiomer of (-)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28-%29-beta-caryophyllene, which occurs much more widely than the (+)-form. It has a role as a metabolite. It is an enantiomer of a (-)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28-%29-beta-caryophyllene.

2.1.2InChI

InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6+/t13-,14-/m0/s1

====================

On Sat, Aug 17, 2019 at 8:51 AM Ambarish Kumar notifications@github.com wrote:

Sir, How to select or keep CID in case multiple CIDs are generated for same compound name.

For example.

trans-caryophyllene 5281522 trans-caryophyllene 5281515

isocaryophyllene oxide 14350 isocaryophyllene oxide 1742211

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS7YZNGIURV5M566ZZ3QE6UWXA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QFZBY#issuecomment-522214535, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS6CM7XCQ64D4ETHHPDQE6UWXANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 5 years ago

The problems are:

"beta-caryophyllene" is ambiguous. It could be either (-) or (+) and there is no way of knowing which - they both occur. But PubChem may simply guess. This is not good, but it's hard to build the correct data structure.

"iso-caryophyllene has a synonym "beta-caryophyllene". This means that the same name has been used to refer to two different compounds. There is nothing we can do other than to ask experts which we should take.

THE FUNDAMENTAL REPRESENTATION IS THE INCHI. WE SHOULD ALWAYS USE THIS AS IT DESCRIBES EXACTLY WHAT WE HAVE.

It may be a good idea to do the primary lookup in ChEBI as I expect their quality control is higher.

On Sat, Aug 17, 2019 at 10:51 AM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

This is a real and serious problem. @mannyrules take note. Chemical names are sometimes used to represent different structures, either because of generic nature or mistakes. Here are the entries for caryophyllene in Pubchem:

5281515 beta-Caryophyllene THIS NAME IS AMBIGUOUS , IT DOESN'T GIVE THE CHIRALITY

PubChem CID: 5281515 Structure: [image: beta-Caryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=2D-Structure [image: beta-Caryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID5281515%20structure&tab=similarity Chemical Safety: [image: Irritant] [image: Health Hazard]

https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=Safety-and-Hazards Laboratory Chemical Safety Summary (LCSS) Datasheet https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#datasheet=LCSS Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:

BETA-CARYOPHYLLENE

Caryophyllene

(-)-trans-Caryophyllene

(-)-beta-caryophyllene

THIS IS A DEFINED STEREO ISOMER

87-44-5 More... https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:

204.35 g/mol Dates:

  • Modify:

    2019-08-10

  • Create:

    2005-06-24

(-)-beta-caryophyllene is a beta-caryophyllene in which the stereocentre adjacent to the exocyclic double bond has S configuration while the remaining stereocentre has R configuration. It is the most commonly occurring form of beta-caryophyllene, occurring in many essential oils, particularly oil of cloves. It has a role as a non-steroidal anti-inflammatory drug, a fragrance and a metabolite. It is an enantiomer of a (+)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene.

beta-Caryophyllene, also known as caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/caryophyllene or (−)-β-caryophyllene, is a natural bicyclic sesquiterpene that is a constituent of many essential oils including that of Syzygium aromaticum (cloves), Cannabis sativa, rosemary, and hops. It is usually found as a mixture with isocaryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/isocaryophyllene (the cis double bond isomer) and α-humulene (obsolete name: α-caryophyllene), a ring-opened isomer. beta-Caryophyllene is notable for having both a cyclobutane https://pubchem.ncbi.nlm.nih.gov/compound/cyclobutane ring and a trans-double bond in a nine-membered ring, both rarities in nature . beta-Caryophyllene is a sweet and dry tasting compound that can be found in a number of food items such as allspice, fig, pot marjoram, and roman camomile, which makes beta-caryophyllene a potential biomarker for the consumption of these food products. beta-Caryophyllene can be found in feces and saliva.

2.1.2InChI *** THIS IS THE FUNDAMENTAL REPRESENTATION

InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6+/t13-,14-/m1/s1

===================

=========5281522==========

Isocaryophyllene PubChem CID: 5281522 Structure: [image: Isocaryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=2D-Structure [image: Isocaryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID5281522%20structure&tab=similarity Chemical Safety: [image: Irritant] [image: Health Hazard]

https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=Safety-and-Hazards Laboratory Chemical Safety Summary (LCSS) Datasheet https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#datasheet=LCSS Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:

Isocaryophyllene

Caryophyllene

UNII-NRY8I0KNIR

beta-Caryophyllen

THIS NAME IS WRONG

gamma-caryophyllene More... https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:

204.35 g/mol Dates:

  • Modify:

    2019-08-10

  • Create:

    2005-06-24

Isocaryophyllene is a sesquiterpenoid.

Isocaryophyllene, also known as gamma-caryophyllene, belongs to the class of organic compounds known as sesquiterpenoids. Sesquiterpenoids are terpenes with three consecutive isoprene https://pubchem.ncbi.nlm.nih.gov/compound/isoprene units. Isocaryophyllene can be found primarily in saliva. Isocaryophyllene is found in allspice, and is widespread in plants (Jasminum, Origanum, and Pimpinella species).

2.1.2InChI * THE INCHI IS DIFFERENT https://pubchem.ncbi.nlm.nih.gov/compound/Isocaryophyllene#section=InChI&fullscreen=true*

InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6-/t13-,14-/m1/s1

===============(+)-beta-caryophyllene====

There is an enatiomer of

5281515 which is

===== 20831623==== (+)-beta-Caryophyllene PubChem CID: 20831623 Structure: [image: (+)-beta-Caryophyllene_small.png] https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=2D-Structure [image: (+)-beta-Caryophyllene_3D_Structure.png] https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=3D-Conformer Find Similar Structures https://pubchem.ncbi.nlm.nih.gov/#query=CID20831623%20structure&tab=similarity Molecular Formula: C15H24 https://pubchem.ncbi.nlm.nih.gov/#query=C15H24 Chemical Names:

(+)-beta-caryophyllene

(+)-caryophyllene

(1S,4E,9R)-4,11,11-trimethyl-8-methylidenebicyclo[7.2.0]undec-4-ene

trans-(1S,9R)-4,11,11-trimethyl-8-methylenebicyclo[7.2.0]undec-4-ene

10579-93-8 More... https://pubchem.ncbi.nlm.nih.gov/compound/%28%2B%29-beta-caryophyllene#section=Depositor-Supplied-Synonyms Molecular Weight:

204.35 g/mol Dates:

  • Modify:

    2019-08-10

  • Create:

    2007-12-05

(+)-beta-caryophyllene is a beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/beta-caryophyllene in which the stereocentre adjacent to the exocyclic double bond has R configuration while the remaining stereocentre has S configuration. It is the enantiomer of (-)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28-%29-beta-caryophyllene, which occurs much more widely than the (+)-form. It has a role as a metabolite. It is an enantiomer of a (-)-beta-caryophyllene https://pubchem.ncbi.nlm.nih.gov/compound/%28-%29-beta-caryophyllene.

2.1.2InChI

InChI=1S/C15H24/c1-11-6-5-7-12(2)13-10-15(3,4)14(13)9-8-11/h6,13-14H,2,5,7-10H2,1,3-4H3/b11-6+/t13-,14-/m0/s1

====================

On Sat, Aug 17, 2019 at 8:51 AM Ambarish Kumar notifications@github.com wrote:

Sir, How to select or keep CID in case multiple CIDs are generated for same compound name.

For example.

trans-caryophyllene 5281522 trans-caryophyllene 5281515

isocaryophyllene oxide 14350 isocaryophyllene oxide 1742211

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCS7YZNGIURV5M566ZZ3QE6UWXA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QFZBY#issuecomment-522214535, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS6CM7XCQ64D4ETHHPDQE6UWXANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Yes sir, InChIs are more reliable as an identifier to get all information of chemical compounds.

ambarishK commented 5 years ago

Sir,

Please go through the sheet. It contains original_name for compounds, their clean_names and retrieved cids (to check weather clean_name entry is searchable into the PubChem API or not).

Column description is as follows.

  1. Compound_identifiers - assigned unique ID to compounds for EssoilDB2.0.
  2. original_name - original name of compounds as present into db.
  3. clean_name - cleaned names.
  4. cid - PubChem cid.

Name cleaning is done after making following changes.

1. All stereo-isomeric notations are made into caps-letters.
2. " -> removed.
3. <extraSpace>-    -> removed<extraSpace>
4. ##    -> replaced by white-space.
5. #   ->  removed.
6. utf-8 encoding.
7. ,<extraSpace>  -> ,  
8. (-)    ->   (-)-
9. (-)<extraSpace>-    -> (-)-
10. (-)<extraSpace>     ->  (-)-
11. *    -> removed
12. --   ->  -
13. removing special characters.
14. removing extra introduced characters.

Total 2974 clean_name entries generates cid. It needs to clean names to more extent after looking their present name entries into the database - PubChem or ChEBI.

petermr commented 5 years ago

This is useful progress. I note that you have resolved 2974 (out of 7171) compounds in Pubchem (i.e. found CIDs). I don't believe this is the total possible (see below)

I note also that there are synonyms: C1369 1-butanol 1-butanol 263 C3513 butan-1-ol butan-1-ol 263 C3517 butanol butanol 263In the final table this should be a single logical entry, with synonyms. It depends on how the table holds this. There could be a list with a separator (e.g. "|") or a separate compoundSynonymTable. At this stage I'd suggest the former.

However I think there are many compounds which can be resolved. Thus

2976b (-)-beta-ocimene is marked NA but a manual Pubchem search for "(-)-beta-ocimene" gives

(Z)-BETA-OCIMENE; cis-beta-Ocimene; cis-Ocimene; (Z)-3,7-Dimethylocta-1,3,6,-triene; beta-cis-Ocimene; ... https://pubchem.ncbi.nlm.nih.gov/compound/5320250 Compound CID: 5320250 https://pubchem.ncbi.nlm.nih.gov/compound/5320250 MF: C10H16 https://pubchem.ncbi.nlm.nih.gov/search/#query=C10H16 MW: 136.23g/mol InChIKey: IHPKGUQCSIINRJ-NTMALXAHSA-N IUPAC Name: (3Z)-3,7-dimethylocta-1,3,6-triene Create Date: 2005-03-27

I suspect that there are a lot of other entries that could be resolved.

Capitalization: C811 (2r,5e)-caryophyll-5-en-12-al (2r,5e)-caryophyll-5-en-12-al NA C812 (2r,5s)-caryophyll-5-en-12-al (2r,5s)-caryophyll-5-en-12-al NA C813 (2s,5e)-caryophyll-5-en-12-al (2s,5e)-caryophyll-5-en-12-al

should be (2R,5E)

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Yes sir, There are synonyms present into present table that need to be normalized as next step.

Resolving more entries are possible after following PubChem lookup as in replacing

 (2r,5e) -> (2R,5E)
 (-) or (+) -> (Z) or (E)
 -alpha- or -beta- or -gamma-   ->  -.alpha.- or -.beta.- or -.gamma.-

and so on.

Keeping synonyms into same table with separator "|" would be better.

petermr commented 5 years ago

BE VERY VERY VERY VERY CAREFUL. You CANNOT equate R/S with +/- You CANNOT equate cis/trans with E/Z NEVER The only thing that can be automatically normalized is (e) -> (E) (also Z, R, S)

On Sun, Aug 18, 2019 at 1:48 PM Ambarish Kumar notifications@github.com wrote:

Yes sir, There are synonyms present into present table that need to be normalized as next step.

Resolving more entries are possible after following PubChem lookup as in replacing

(2r,5e) -> (2R,5E)

YES

(-) or (+) -> (Z) or (E)

NO NO NO NO NO NO NEVER NO

-alpha- or -beta- or -gamma- -> -.alpha.- or -.beta.- or -.gamma.-

NO alpha is often part of the name. It is difficult to know what to do with the Greek characters. I would leave them as Unicode. The only thing that really matters is the InChI Everything else can be looked up from that.

and so on.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZQ4APTAHWY77CQRALQFFAIVA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4Q7LRQ#issuecomment-522319302, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZMSSCJDA3WSGL34LLQFFAIVANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 5 years ago

Did you find out why only 2974 compounds were looked up? Also I would suggest using ChEBI where possible. I think it;s better quality than PubChem.

On Sun, Aug 18, 2019 at 2:31 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

BE VERY VERY VERY VERY CAREFUL. You CANNOT equate R/S with +/- You CANNOT equate cis/trans with E/Z NEVER The only thing that can be automatically normalized is (e) -> (E) (also Z, R, S)

On Sun, Aug 18, 2019 at 1:48 PM Ambarish Kumar notifications@github.com wrote:

Yes sir, There are synonyms present into present table that need to be normalized as next step.

Resolving more entries are possible after following PubChem lookup as in replacing

(2r,5e) -> (2R,5E)

YES

(-) or (+) -> (Z) or (E)

NO NO NO NO NO NO NEVER NO

-alpha- or -beta- or -gamma- -> -.alpha.- or -.beta.- or -.gamma.-

NO alpha is often part of the name. It is difficult to know what to do with the Greek characters. I would leave them as Unicode. The only thing that really matters is the InChI Everything else can be looked up from that.

and so on.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/76?email_source=notifications&email_token=AAFTCSZQ4APTAHWY77CQRALQFFAIVA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4Q7LRQ#issuecomment-522319302, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZMSSCJDA3WSGL34LLQFFAIVANCNFSM4ICLYMFQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Ok sir. I will be careful and will check for the isomeric-notations available into the database.

Due tonaming conventions and inconsistencies of available compound names into the EssoilDB with that of PubChem repository, I could get 2974 available entries.

Sir, I tried using ChEBI API but it could take ChEBI IDs for processing any query. Also, I found it more meant for chemical compounds annotations. Please send me exact ChEBI API which is suitable to the situation and can take compound name as an initial input.

petermr commented 5 years ago

I don't understand why the PubChemAPI doesn't get a large number of the entries marked NA. First 6 Examples immediately after 2974

(-)-ar-curcumen-15-al NA ar-Curcumen-15-al; XVWGGKCJOXAGDW-UHFFFAOYSA-N https://pubchem.ncbi.nlm.nih.gov/compound/10846393 Compound CID: 10846393 https://pubchem.ncbi.nlm.nih.gov/compound/10846393

MF: C15H20O https://pubchem.ncbi.nlm.nih.gov/search/#query=C15H20O MW: 216.32g/mol InChIKey: XVWGGKCJOXAGDW-UHFFFAOYSA-N IUPAC Name: 4-(6-methylhept-5-en-2-yl)benzaldehyde Create Date: 2006-10-26

C766 (-)-beta-ocimene (-)-beta-ocimene NA (Z)-BETA-OCIMENE; cis-beta-Ocimene; cis-Ocimene; (Z)-3,7-Dimethylocta-1,3,6,-triene; beta-cis-Ocimene; ... https://pubchem.ncbi.nlm.nih.gov/compound/5320250 Compound CID: 5320250 https://pubchem.ncbi.nlm.nih.gov/compound/5320250 MF: C10H16 https://pubchem.ncbi.nlm.nih.gov/search/#query=C10H16 MW: 136.23g/mol InChIKey: IHPKGUQCSIINRJ-NTMALXAHSA-N IUPAC Name: (3Z)-3,7-dimethylocta-1,3,6-triene

C770 (-)-elema-1,3,11(13)-trien-12-al (-)-elema-1,3,11(13)-trien-12-al NA SCHEMBL14215827; DJZHNAGRSWMVPA-QLFBSQMISA-N; (-)-Elema-1,3,11(13)-trien-12-al https://pubchem.ncbi.nlm.nih.gov/compound/11651448 Compound CID: 11651448 https://pubchem.ncbi.nlm.nih.gov/compound/11651448

MF: C15H22O https://pubchem.ncbi.nlm.nih.gov/search/#query=C15H22O MW: 218.33g/mol InChIKey: DJZHNAGRSWMVPA-QLFBSQMISA-N IUPAC Name: 2-[(1R,3S,4S)-4-ethenyl-4-methyl-3-prop-1-en-2-ylcyclohexyl]prop-2-enal Create Date: 2006-10-26

C837 (Ŕ) Ŕ gamma -curcumen-15-al (-)-gamma-curcumen-15-al NA (-)-.gamma.-curcumen-15-al; IAYOZXCTYXYCHP-UHFFFAOYSA-N https://pubchem.ncbi.nlm.nih.gov/compound/91747467 Compound CID: 91747467 https://pubchem.ncbi.nlm.nih.gov/compound/91747467

MF: C15H22O https://pubchem.ncbi.nlm.nih.gov/search/#query=C15H22O MW: 218.33g/mol InChIKey: IAYOZXCTYXYCHP-UHFFFAOYSA-N IUPAC Name: 4-(6-methylhept-5-en-2-yl)cyclohexa-1,3-diene-1-carbaldehyde Create Date: 2015-04-28

C772 (-)-kaur-16-en-19-al (-)-kaur-16-en-19-al NA ent-Kaurenal; ent-Kaur-16-en-19-al; CHEBI:15418; LMPR0104130005 https://pubchem.ncbi.nlm.nih.gov/compound/10062561 Compound CID: 10062561 https://pubchem.ncbi.nlm.nih.gov/compound/10062561

MF: C20H30O https://pubchem.ncbi.nlm.nih.gov/search/#query=C20H30O MW: 286.5g/mol InChIKey: JCAVDWHQNFTFBW-GNVSMLMZSA-N IUPAC Name: (1S,4S,5R,9S,10R,13S)-5,9-dimethyl-14-methylidenetetracyclo[11.2.1.01,10.04,9]hexadecane-5-carbaldehyde

C775 (-)-pacifigorgia-1(6),10-diene (-)-pacifigorgia-1(6),10-diene NA Pacifigorgia-1(6),10-diene; VGMZAEHYZOQRSK-HUBLWGQQSA-N https://pubchem.ncbi.nlm.nih.gov/compound/12051852 Compound CID: 12051852 https://pubchem.ncbi.nlm.nih.gov/compound/12051852

MF: C15H24 https://pubchem.ncbi.nlm.nih.gov/search/#query=C15H24 MW: 204.35g/mol InChIKey: VGMZAEHYZOQRSK-HUBLWGQQSA-N IUPAC Name: (1S,4R,5S)-1,5-dimethyl-4-(2-methylprop-1-enyl)-2,3,4,5,6,7-hexahydro-1H-indene

So I get 6 out of 6 on the Manual Pubchem API . Please check that you can retrieve these as well. If there are problems with the API we need to find them.

There are some corruputed names:

C6522 thuj-3-en-10-a1 thuj-3-en-10-a1 NA

This is a typo - should be thuj-3-en-10-al

C7036 propyl sovalerate propyl sovalerate NA This is a typo - should be propyl isovalerate

DO NOT TRY TO CORRECT THESE - leave them to ME.

So REMOVE all mixtures (compound + compound). This probaby needs to be done manually USE Pubchem to resolve as many names as possible. Then create a list of unresolved names for manually checking. I do not expect more than 500 unresolved names

Then we will aggregate duplicates.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Sir, Removed all comp+comp and comp/comp mixtures.

The reason what I find behind PubChem generated NA entries corresponding to majority of compound names (used as an query input) is unavailability of their synonyms mentioned by depositor into the PubChem.

Resolving all remaining names using PubChem REST API.

Please go through the first 208 findings as first batch job - sheet.

Table columns are as follows.

70 entries are available into PubChem and rest are not searchable.

Example for not-retrieved entries are as follows. PubChem_lookup is generated after truncating the names. Is it a right way to get PubChem_lookup and retrieve compound_cid?

ID          not-retrieved entries              PubChem_lookup          Compound_CID

C898      (E)-3-hexanoic acid               HEXANOIC ACID              8892 

C5          (E)-2,2-decenal                        not found                        NA

C4          (E)-2,(Z)-6-decadienal          2,6-Decadienal                5283350

C893       (E)-2-undecenol                   Undecenol                   22506525   

C891       (E)-2-undecanal                   UNDECANAL                     8186 

C799       (2)-3-hexenylacetate       Cis-3-Hexenyl Acetate       5363388 

C800       (2)-3-hexenylbenzoate    Cis-3-HEXENYLBENZOATE    32809     

Search for C916 (E)-9-Epi-Caryophyllene generates (Z)-Caryophyllene; (Z)-.Beta.-Caryophyllene; 9-Epicaryophyllene; 9-Epi-Caryophyllene with compund CID - 6429301.

Search for (E)-bisabol-11-ol generates (Z)-Bisabol-11-Ol; AXLLSNSRONSXGV-MLPAPPSSSA-N with compound CID - 91750291.

For the searches as in above both, I concern about E and Z isomerism.

Search for C931 (E)-b-ocimene generates OCIMENE; (E)-Beta-Ocimene; Trans-Beta-Ocimene; 13877-91-3; Beta-Ocimene; Trans-Ocimene; (E)-3,7-Dimethylocta-1,3,6-Triene; (3E)-3,7-Dimethylocta-1,3,6-Triene; with compound CID - 5281553

sir, should I keep the lookup result?

All Original names are same as before (as a separate column).