Closed petermr closed 4 years ago
OK sir.
So, first we should retrieve all synonyms for present 2100 compounds of EssoilDB from PubChem or Wikidata.
Then after assign unique EID to each synonym (one record per row into the synonym table).
Sir,
How to separate synonym values into different rows.
e.g -
C214 | acetate\|Acetate Ion\|Acetic acid, ion(1-)\|Acetate ions\|71-50-1\|monoacetate\|MeCO2 anion\|UNII-569DQM74SC\|ethanoate\|Acetat\|569DQM74SC\|Ethanoat\|Shotgun\|CHEMBL1354\|racemic acetate\|Azetat\|acetyl hydroxide\|Acetic cid glacial\|TCLP extraction fluid 2\|AC1Q1J2O\|AC1Q1J9D\|AC1L18N9\|CH3-COO(-)\|DTXSID1037694\|CTK5D4394\|CHEBI:30089\|QTBSBXVTEAMEQO-UHFFFAOYSA-M\|STL282721\|CMC_13391\|BDBM50159793\|AKOS022101130\|AN-25008\|AN-23801\|LS-189936\|1395-EP2380874A2\|1395-EP2380661A2\|1395-EP2316837A1\|1395-EP2316833A1\|1395-EP2316827A1\|1395-EP23
C215 | acetic acid\|ethanoic acid\|64-19-7\|Ethylic acid\|Acetic acid, glacial\|Acetic acid glacial\|Methanecarboxylic acid\|Glacial acetic acid\|Vinegar acid\|Acetasol\|Acide acetique\|Essigsaeure\|Aci-jel\|Azijnzuur\|Vinegar\|Kyselina octova\|Acido acetico\|Octowy kwas\|Pyroligneous acid\|HOAc\|Azijnzuur [Dutch]\|Ethanoic acid monomer\|acetyl alcohol\|Aceticum acidum\|Essigsaeure [German]\|ethoic acid\|Caswell No. 003\|Otic Tridesilon\|Octowy kwas [Polish]\|Otic Domeboro\|Acetic acid (natural)\|Kyselina octova [Czech]\|Acide acetique [French]\|
C2776 | acetaldehyde\|ethanal\|acetic aldehyde\|ethyl aldehyde\|75-07-0\|Acetaldehyd\|Acetylaldehyde\|aldehyde\|Octowy aldehyd\|Acetic ethanol\|Aldeide acetica\|Aldehyde acetique\|acetaldehydes\|Azetaldehyd\|RCRA waste number U001\|Acetaldehyde (natural)\|NSC 7594\|NCI-C56326\|Acetaldehyd [German]\|ACETYL GROUP\|ethaldehyde\|CCRIS 1396\|HSDB 230\|Octowy aldehyd [Polish]\|UNII-GO1N1ZPR3B\|Aldeide acetica [Italian]\|Aldehyde acetique [French]\|ethanone\|UN1089\|MFCD00006991\|FEMA No. 2003\|CHEBI:15343\|AI3-31167\|CH3CHO\|EINECS 200-836-8\|GO1N1ZPR3B\|R
Sir, it got resolved. Above output is of PubChem web-api. PubChem identifier exchange services produces results into two column values for input-output correspondence.
175 acetate
175 Acetate Ion
175 Acetic acid, ion(1-)
175 Acetate ions
175 71-50-1
175 monoacetate
175 MeCO2 anion
175 UNII-569DQM74SC
175 ethanoate
175 Acetat
Now next is to map them with unique EID.
No, one row per synonym
On Mon, Sep 2, 2019 at 1:16 PM Ambarish Kumar notifications@github.com wrote:
Sir,
How to separate synonym values into different rows.
e.g -
C214 | acetate | preferred
C214 | Acetic acid C214 | Acetate Ion
C215 | acetic acid | preferred C215 | ethanoic acid
C2776 | acetaldehyde | preferred C2776 |ethanal
The preferred name should be the common name in Wikidata or Wikipedia. This may require my judgment.
P.
Sir, o/p of PubChem identifier exchange services is one synonym in a row (two columns for each input-output correspondence).
Sir, Please go through the compound synonym table.
Compound synonyms are retrieved from PubChem identifier exchange services
Steps to get compound synonyms are as follows.
synonyms
option in Output IDs drop-down list.Two column file showing each input-output correspondence
option in Output Method
.No compression
option in Compression
drop-down list.Steps to map compound synonyms with EssoilDB unique identifier.
>CIDEID<-read.csv("C:/Users/AMBARISH/Documents/cid_eid.csv")
> CIDSYN<-read.csv("C:/Users/AMBARISH/Documents/cid_syn.csv")
>CIDEIDSYN<-merge(cideid,cid_syn,by="cid",all.y = TRUE)
> View(CIDEIDSYN)
> write.csv(CIDEIDSYN, "C:/Users/AMBARISH/Documents/compoundSynonymTable.csv")
Thanks very much, Clearly Pubchem has too many synonyms - tradenames, etc. It's about 50 per compound. We can remove some of them by regexes. We'll keep this, but maybe Wikidata is better. Can you try that? It's a difficult problem...
Yes sir.
Sir, what is wikidata property for chemical compound synonums? (to be used in SPARQL query)
Good questions. I don't know. I'll think and ask about it. I don't think it's top priority yet.
On Thu, Sep 5, 2019 at 3:34 PM Ambarish Kumar notifications@github.com wrote:
Sir, what is wikidata property for chemical compound synonums? (to be used in SPARQL query)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/12?email_source=notifications&email_token=AAFTCS4E2DIV6IOUEDQHBLLQIEKIZA5CNFSM4IS2L7W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD57KKXQ#issuecomment-528393566, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZ5WFH57XW4H4OZ5FTQIEKIZANCNFSM4IS2L7WQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Discussed with @ambarishK today. Searching PubChem or ChEBI or even Wikidata gives far too many synonyms that will never be used in phytochemical papers.
So: (for Ambarish) collect the (approx) 400 synonyms we found in E1.0 and add them. Let's then see how many new compounds we get in (say) 1000 papers.
We don't have a very clear idea of the "accepted" name (there isn't such a thing in chemistry, unless they are drigs, etc). Suggest we use the primary name in Wikipedia. Thus: "Acetic acid /əˈsiːtɪk/, systematically named ethanoic acid /ˌɛθəˈnoʊɪk/, is a ..." and the page is: https://en.wikipedia.org/wiki/Acetic_acid so we should call "Acetic acid" the preferred name and "ethanoic acid" a synonym. All references to "ethanoic acid" should be used for searching but hits should be routed to "acetic acid". With plants the preferred name should come from GBIF lookup.
Sir, please go through the compound synonym names extracted from EssoilDB1.0
Each synonym is reported as one record per row.
Column description is as per follows.
CSID
- Compound synonym ID.EID
- Unique identifiers assigned to each compound.synonym
- Compound synonym name.Total number of records 2812
Example -
CSID EID synonym
CS1 C214 acetate
CS2 C215 acetic acid
CS3 C2776 acetaldehyde
CS4 C170 3-hydroxy-2-butanone
CS5 C170 3-hydroxybutan-2-one
CS6 C170 acetoin
CS7 C2780 acetone
CS8 C298 benzaldehyde
CS9 C3196 benzene
We have ca 2100 compounds in EssoilDB 1.0. The critcal information is:
The problem is that a compound occurs under many synonyms (e.g.
anisole
will be retrieved by current dictionary.We need a simple synonym table, one row per synonym
All names therefore are resolved through a unique EID
Synonyms can be retrieved from Pubchem or Wikidata.