petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

synonyms for compounds #12

Closed petermr closed 4 years ago

petermr commented 4 years ago

We have ca 2100 compounds in EssoilDB 1.0. The critcal information is:

The problem is that a compound occurs under many synonyms (e.g.

C3125   anisole

We need a simple synonym table, one row per synonym

C3125   anisole
C3125   methoxybenzene
C3125     phenyl methyl ether

All names therefore are resolved through a unique EID

Synonyms can be retrieved from Pubchem or Wikidata.

ambarishK commented 4 years ago

OK sir.

So, first we should retrieve all synonyms for present 2100 compounds of EssoilDB from PubChem or Wikidata.

Then after assign unique EID to each synonym (one record per row into the synonym table).

ambarishK commented 4 years ago

Sir,

How to separate synonym values into different rows.

e.g -


C214 | acetate\|Acetate Ion\|Acetic acid,   ion(1-)\|Acetate ions\|71-50-1\|monoacetate\|MeCO2   anion\|UNII-569DQM74SC\|ethanoate\|Acetat\|569DQM74SC\|Ethanoat\|Shotgun\|CHEMBL1354\|racemic   acetate\|Azetat\|acetyl hydroxide\|Acetic cid glacial\|TCLP extraction fluid   2\|AC1Q1J2O\|AC1Q1J9D\|AC1L18N9\|CH3-COO(-)\|DTXSID1037694\|CTK5D4394\|CHEBI:30089\|QTBSBXVTEAMEQO-UHFFFAOYSA-M\|STL282721\|CMC_13391\|BDBM50159793\|AKOS022101130\|AN-25008\|AN-23801\|LS-189936\|1395-EP2380874A2\|1395-EP2380661A2\|1395-EP2316837A1\|1395-EP2316833A1\|1395-EP2316827A1\|1395-EP23

C215 | acetic acid\|ethanoic acid\|64-19-7\|Ethylic acid\|Acetic acid,   glacial\|Acetic acid glacial\|Methanecarboxylic acid\|Glacial acetic   acid\|Vinegar acid\|Acetasol\|Acide   acetique\|Essigsaeure\|Aci-jel\|Azijnzuur\|Vinegar\|Kyselina octova\|Acido   acetico\|Octowy kwas\|Pyroligneous acid\|HOAc\|Azijnzuur [Dutch]\|Ethanoic acid   monomer\|acetyl alcohol\|Aceticum acidum\|Essigsaeure [German]\|ethoic   acid\|Caswell No. 003\|Otic Tridesilon\|Octowy kwas [Polish]\|Otic   Domeboro\|Acetic acid (natural)\|Kyselina octova [Czech]\|Acide acetique [French]\|

C2776 | acetaldehyde\|ethanal\|acetic aldehyde\|ethyl   aldehyde\|75-07-0\|Acetaldehyd\|Acetylaldehyde\|aldehyde\|Octowy aldehyd\|Acetic   ethanol\|Aldeide acetica\|Aldehyde acetique\|acetaldehydes\|Azetaldehyd\|RCRA   waste number U001\|Acetaldehyde (natural)\|NSC 7594\|NCI-C56326\|Acetaldehyd   [German]\|ACETYL GROUP\|ethaldehyde\|CCRIS 1396\|HSDB 230\|Octowy aldehyd   [Polish]\|UNII-GO1N1ZPR3B\|Aldeide acetica [Italian]\|Aldehyde acetique   [French]\|ethanone\|UN1089\|MFCD00006991\|FEMA No.   2003\|CHEBI:15343\|AI3-31167\|CH3CHO\|EINECS 200-836-8\|GO1N1ZPR3B\|R
ambarishK commented 4 years ago

Sir, it got resolved. Above output is of PubChem web-api. PubChem identifier exchange services produces results into two column values for input-output correspondence.

175 acetate
175 Acetate Ion
175 Acetic acid, ion(1-)
175 Acetate ions
175 71-50-1
175 monoacetate
175 MeCO2 anion
175 UNII-569DQM74SC
175 ethanoate
175 Acetat

Now next is to map them with unique EID.

petermr commented 4 years ago

No, one row per synonym

On Mon, Sep 2, 2019 at 1:16 PM Ambarish Kumar notifications@github.com wrote:

Sir,

How to separate synonym values into different rows.

e.g -

C214 | acetate | preferred

C214 | Acetic acid C214 | Acetate Ion

C215 | acetic acid | preferred C215 | ethanoic acid

C2776 | acetaldehyde | preferred C2776 |ethanal

The preferred name should be the common name in Wikidata or Wikipedia. This may require my judgment.

P.

ambarishK commented 4 years ago

Sir, o/p of PubChem identifier exchange services is one synonym in a row (two columns for each input-output correspondence).

ambarishK commented 4 years ago

Sir, Please go through the compound synonym table.

Compound synonyms are retrieved from PubChem identifier exchange services

Steps to get compound synonyms are as follows.

Steps to map compound synonyms with EssoilDB unique identifier.

petermr commented 4 years ago

Thanks very much, Clearly Pubchem has too many synonyms - tradenames, etc. It's about 50 per compound. We can remove some of them by regexes. We'll keep this, but maybe Wikidata is better. Can you try that? It's a difficult problem...

ambarishK commented 4 years ago

Yes sir.

ambarishK commented 4 years ago

Sir, what is wikidata property for chemical compound synonums? (to be used in SPARQL query)

petermr commented 4 years ago

Good questions. I don't know. I'll think and ask about it. I don't think it's top priority yet.

On Thu, Sep 5, 2019 at 3:34 PM Ambarish Kumar notifications@github.com wrote:

Sir, what is wikidata property for chemical compound synonums? (to be used in SPARQL query)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/12?email_source=notifications&email_token=AAFTCS4E2DIV6IOUEDQHBLLQIEKIZA5CNFSM4IS2L7W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD57KKXQ#issuecomment-528393566, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZ5WFH57XW4H4OZ5FTQIEKIZANCNFSM4IS2L7WQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

Discussed with @ambarishK today. Searching PubChem or ChEBI or even Wikidata gives far too many synonyms that will never be used in phytochemical papers.

So: (for Ambarish) collect the (approx) 400 synonyms we found in E1.0 and add them. Let's then see how many new compounds we get in (say) 1000 papers.

We don't have a very clear idea of the "accepted" name (there isn't such a thing in chemistry, unless they are drigs, etc). Suggest we use the primary name in Wikipedia. Thus: "Acetic acid /əˈsiːtɪk/, systematically named ethanoic acid /ˌɛθəˈnoʊɪk/, is a ..." and the page is: https://en.wikipedia.org/wiki/Acetic_acid so we should call "Acetic acid" the preferred name and "ethanoic acid" a synonym. All references to "ethanoic acid" should be used for searching but hits should be routed to "acetic acid". With plants the preferred name should come from GBIF lookup.

ambarishK commented 4 years ago

Sir, please go through the compound synonym names extracted from EssoilDB1.0

uniqueCompSynonym20190910.tsv

Each synonym is reported as one record per row.

Column description is as per follows.

Total number of records 2812

Example -

CSID            EID                   synonym
CS1                C214                     acetate
CS2                C215                     acetic acid
CS3                C2776              acetaldehyde
CS4                C170              3-hydroxy-2-butanone
CS5                C170                   3-hydroxybutan-2-one
CS6                C170                      acetoin
CS7                C2780                    acetone
CS8                C298                    benzaldehyde
CS9                C3196                   benzene