petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

compound synonyms and stereochemistry #54

Open petermr opened 4 years ago

petermr commented 4 years ago

The compound names in table columns are frequently ambiguous. The first table is https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/thyme.tsv

Compound    Compound_dictionary_lookpup E2.0_compound_identifiers   notes   wikidata_identifier
alpha-Thujene   (-)-alpha-thujene ; (+)-alpha-thujene   C764 ; C786 stereo-isomers of the compounds are there.  Q27121815 ; Q27121804
alpha-Pinene    alpha-Pinene    C2849   Also, stereo-isomers of the compounds are there.    Q27104380
beta-Pinene beta-Pinene C349    Also, stereo-isomers of the compounds are there.    
beta-Myrcene    beta-Myrcene    C345        Q424577
alpha-Phellandrene  alpha-Phellandrene  C2848       Q19606345
Carene<δ-2->    2-carene    C1720   Lookup is of '2-carene' 
D-Limonene  (+)-limonene    C792        Q27888324
beta-Phellandrene   beta-Phellandrene   C3426       Q19606727
para-Cymene cymene  C4118   Other cymene are present as 'm-cymenene', 'dehydro-p-cymene', 'o-cymene',   Q284072
gamma-Terpinene beta-terpinene  C355    Present as beta-terpinene   Q23057921
Terpineol   1-terpineol C1482       Q27276701
Terpinen-4-ol   (+)-terpinen-4-ol   C795        Q27280168
Thymol          not present.    
Caryophyllene   (z)-caryophyllene ; 9-epi-(E)-caryophyllene ; alpha-caryophyllene   C1255 ; C2705 ; C2915   Stereo-isomers are present  NA ; Q27137093 ; Q1995108
petermr commented 4 years ago

implementing

Will start by creating a bag of unknown terms.

petermr commented 4 years ago

analysing isomerism and synonyms

We need to sort compounds by WikidataID and PubchemCID to determine synonyms. Example:

para-cymen-7-ol             325 4-Isopropylbenzyl alcohol   
p-cymen-7-ol    p-cymen-7-ol                325 4-Isopropylbenzyl alcohol   

These two entries relate to the same CID so should be grouped together. PMR will then decide which is the best to keep

cuminaldehyde   cuminaldehyde   cuminaldehyde   Q419952     326 4-Isopropylbenzaldehyde 
cuminal cuminal cuminaldehyde   Q419952     326 4-Isopropylbenzaldehyde 
octanal 

has both Wikidata and Pubchem

sort TSV file by WikidataID and remove synonyms

@ambarishK will sort table in a spreadsheet on WikidataID column. notFoundWIKIDATASortedPubChem.tsv PMR will then edit this manually

sort TSV file by PubchemCID and remove synonyms

@ambarishK will sort table in a spreadsheet on PubChemID column. notFoundWIKIDATAPubChemSorted.tsv PMR will then edit this manually

The recommitted files will normalize to a single reference for Wikidata and for Pubchem. PMR will then merge possible conflicts and fuzziness.