petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Consolidating dictionaries #64

Closed petermr closed 4 years ago

petermr commented 4 years ago

The dictionary compilation and construction is evolving and needs further cleaning and aggregation. The history is messy. However the goal is relatively simple:

If the compound is not an EO component that's a small false positive and probable does little harm If there are false negatives they will need to be added later and this could be slightly messy. At this stage we are still working out the strategy for synonyms and fuzziness so we may have to refactor later.

The sources are:

EssoilDB1.0

This contains about 8000 unique names but many are typos . We cleaned this and created about 2100 unique names (but some may be synonyms).

oil186

This is 186 OA articles on EOs. Most have a composition table. We extracted the names from the composition tables , selecting

petermr commented 4 years ago

current files

My current analysis is:

EssoilDB1.0

https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/essoil10.xml 2116 compound names from the E1.0 database, cleaned in places and linked to Wikidata where possible. A few false positives such as

<entry name="nd" term="nd"/>

but generally these are real compounds with Wikidata IDs. There is no history as to how they got into E1.0 or what their frequency is.

These names occur in EO literature.

compound.xml

https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/compound.xml This seems to be identical with essoil10.xml

raw

https://github.com/petermr/CEVOpen/tree/master/dictionary/compound/raw This directory contains trial attempts to resolve synonyms, etc. It includes:

compoundSynonym

https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/raw/compoundSynonym.tsv synonyms added from Wikidata/Pubchem - most do not occur in EO literature and tend to pollute the dictionary.

compound1

https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/raw/compound1.xml

Probably obsolete.

compoundSynonymTable

https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/raw/compoundSynonymTable.tsv

a huge list of synonyms, now deleted.

uniqueCompSynonym20190910

https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/raw/uniqueCompSynonym20190910.tsv

petermr commented 4 years ago

oil186 and oil1000 data

These are most of the compounds in the tables in oil186 . There were aggregated into a multiset which was then used to populate a new dictionary.

oil186 multiset

https://github.com/petermr/CEVOpen/blob/master/searches/oil186/__tables/compound_multiset_raw.txt

starts:

 x 66
α-Pinene x 28
Limonene x 28
β-Pinene x 26
Caryophyllene oxide x 25
Linalool x 24
γ-Terpinene x 23
Camphene x 23
Camphor x 21
p-Cymene x 19
Sabinene x 19
α-Terpineol x 18
Spathulenol x 18
1,8-Cineole x 17
Unidentified x 17
δ-Cadinene x 17
α-Copaene x 16
Borneol x 16
Bornyl acetate x 16
α-Humulene x 16
β-Caryophyllene x 16
α-Terpinene x 15
Myrcene x 15
α-Phellandrene x 15
Terpinolene x 14
α-Thujene x 14
Germacrene D x 14
Terpinen-4-ol x 13
Total x 13
β-pinene x 12

Note that this is not case-sensitive so we have both β-Pinene x 26 and β-pinene x 12 . There are also random white-space variants.

There's a total of 1737 records. Many of the singletons are rubbish (numbers, fragments, etc.) So we take those with at least 2 occurrences.

lines 381-391:

β-Gurjunene x 2
cis-β-Guaiene x 2
Toluene x 2
α-terpinyl acetate x 2
<<start of singletons>>
(Z), (E)- α-farnesene     << included space
Camphora .                     << appended superscript 
cis-Chrysanthenyl acetate
Octilin
1,8-cineole (20.1), α-thujone (25.1), β-thujone (22.9), camphor (10.5) . << unparsed list
allo-Ocimene

That gives 385 probable names and 1350 singletons.

oil1000 multiset

A larger but less analyzed set of 1000 articles:

https://github.com/petermr/CEVOpen/blob/master/searches/oil1000/__tables/compound_multiset1.txt

starts:

Limonene x 108
α-Pinene x 93
Linalool x 91
Caryophyllene oxide x 87
β-Pinene x 85
Camphene x 79
Sabinene x 72
γ-Terpinene x 69
β-Caryophyllene x 67
Spathulenol x 64
p-Cymene x 62

and starts singletons at 972:

γ-Amorphene x 2
Hungary x 2
Protein x 2
8-Heptadecene x 2
<<singletons>>
Acetovanillone
E-Pinocarveol
98.06
Geranialdehyde
(Z), (E)- α-farnesene
2-Ethyl-1-hexyl acetate
Methyl hexadecanoate
2300

So we took 975 terms and used these as the basis of a new dictionary

lookup oil186 terms in e1.0 dictionary

1297 terms from oil186 (not sure how selected) were used to search compound.xml dictionary.

This gave https://github.com/petermr/CEVOpen/blob/master/searches/oil186/__tables/foundNotFound.txt which shows terms found and not found:

Cannot find term in dictionary (+)-cedrol
Cannot find term in dictionary (+)-curcuphenol
Cannot find term in dictionary (+)-fenchol
Cannot find term in dictionary (+)-α-terpineol
Cannot find term in dictionary (+/-)-norephedrine
...
<985>
Cannot find term in dictionary λ-gurjunene
Cannot find term in dictionary ρ-cymene
Cannot find term in dictionary ρ-cymenea,b
Cannot find term in dictionary τ-cadinol
Cannot find term in dictionary τ-muurolol
<the following were in E1.0 and where possible have Wikidata Ids>
found: (-)-caryophyllene oxide
found: (-)-limonene
found: (e)-2-hexenal
found: (e)-2-nonenal
found: (e)-2-octenal
<995>
...
<1292>
found: viridiflorol
found: vulgarol b
found: vulgarone b
found: yomogi alcohol
found: zingiberene
found: zonarene
<1297>

So the first chunk is subjected to further lookup against Wikidata and Pubchem.

petermr commented 4 years ago

lookup notFound in Wikidata and Pubchem

see https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/notFoundCompWIKIDATAPubChem.tsv

Largely through lookup (partially automated) @ambarishK gets:

Compound WIKIDATA_query_id notes PubChem_CID PubChem_cmpdname
butanoic acid butyric acid (Q193213) 264 Butyric acid
cumarin coumarin (Q111812) : CumarinLow German 323 Coumarin
para-cymen-7-ol 325 4-Isopropylbenzyl alcohol
p-cymen-7-ol 325 4-Isopropylbenzyl alcohol
cuminaldehyde cuminaldehyde (Q419952) 326 4-Isopropylbenzaldehyde
cuminal cuminaldehyde (Q419952) : cuminal 326 4-Isopropylbenzaldehyde
octanal caprylaldehyde (Q416673) : n-octanal 454 Octanal

... (980 lines)

NOTE: notes is occasionally used. The wikidata column needs separation of name and id.

This will be the basis of the next version of the dictionary.

EmanuelFaria commented 4 years ago

@petermr two questions for you re: Dictionaries.md

1) Re: description text for Microorganisms dictionary... should I omit Viruses from the definition since we’re not focusing on that right now, or leave it in and see what the data delivers?

[In this version of the Dictionary?] Microorganisms are the bacteria, fungi, yeasts and molds, protozoa, algae, or viruses upon which the experiments were conducted to determine what effect (Activities) EOs may have on them.

2) Owing to this article I recently read, it may be important that we distinguish Gram-positive and Gram-negative wherever possible as a separate field/column. Pos vs Neg may be an important search term in and of itself.

It is particularly worrying, says the WHO, that there are no new drugs imminent against gram-negative bacteria, which can cause pneumonia, bloodstream infections, wound or surgical site infections and meningitis.

Of the 50 antibiotics in the pipeline, 32 target pathogens listed by the WHO in 2017 as a global priority. But most of the drugs have only limited benefits when compared with existing antibiotics. Only two are active against the multi-drug resistant, gram-negative bacteria, which, says the WHO, are spreading rapidly and require urgent solutions.

Gram-negative bacteria, such as Klebsiella pneumoniae and Escherichia coli, can cause severe and often deadly infections. They are a particular threat for people with weak or under-developed immune systems, including newborn babies, ageing populations, and people undergoing surgery and cancer treatment.

petermr commented 4 years ago

On Wed, Jan 22, 2020 at 3:39 PM Emanuel Faria notifications@github.com wrote:

@petermr https://github.com/petermr two questions for you re: Dictionaries.md https://github.com/petermr/CEVOpen/blob/master/BJOC/Dictionaries.md

  1. Re: description text for Microorganisms dictionary... should I omit Viruses from the definition since we’re not focusing on that right now, or leave it in and see what the data delivers?

If the papers mention viruses we keep them in. The only reason for exclusion is the long-tail - singleton mentions.

[In this version of the Dictionary?] Microorganisms are the bacteria, fungi, yeasts and molds, protozoa, algae, or viruses upon which the experiments were conducted to determine what effect (Activities) EOs may have on them.

They are the organisms used as targets. More generallly we should use "target species". The only reason for using "microorganisms" is that the article uses that term. But they should be "medical" - so I think mosquitos, helminths, etc are worth keeping, but not targets of pheromones.

  1. Owing to this article https://www.theguardian.com/business/2020/jan/17/big-pharma-failing-to-invest-in-new-antibiotics-says-who I recently read, it may be important that we distinguish Gram-positive and Gram-negative wherever possible as a separate field/column. Pos vs Neg may be an important search term in and of itself.

No. The dictionaries link to Wikipedia.

It is particularly worrying, says the WHO, that there are no new drugs imminent against gram-negative bacteria, which can cause pneumonia, bloodstream infections, wound or surgical site infections and meningitis.

Of the 50 antibiotics in the pipeline, 32 target pathogens listed by the WHO in 2017 as a global priority. But most of the drugs have only limited benefits when compared with existing antibiotics. Only two are active against the multi-drug resistant, gram-negative bacteria, which, says the WHO, are spreading rapidly and require urgent solutions.

Gram-negative bacteria, such as Klebsiella pneumoniae and Escherichia coli, can cause severe and often deadly infections. They are a particular threat for people with weak or under-developed immune systems, including newborn babies, ageing populations, and people undergoing surgery and cancer treatment.

This may be useful when we write the intro, but not now.

The exercise is to summarise what we have got.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/64?email_source=notifications&email_token=AAFTCS7WJXZJJ5TWVDIGID3Q7BSE3A5CNFSM4JTZHSAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJUBGUA#issuecomment-577246032, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYRIWWAV3LWGYHM2STQ7BSE3ANCNFSM4JTZHSAA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

They are the organisms used as targets. More generally we should use "target species". The only reason for using "microorganisms" is that the article uses that term. But they should be "medical" - so I think mosquitos, helminths, etc are worth keeping, but not targets of pheromones.

Changes made. Thanks.

EmanuelFaria commented 4 years ago

@petermr , can we now close this issue?