Closed petermr closed 4 years ago
My current analysis is:
https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/essoil10.xml 2116 compound names from the E1.0 database, cleaned in places and linked to Wikidata where possible. A few false positives such as
<entry name="nd" term="nd"/>
but generally these are real compounds with Wikidata IDs. There is no history as to how they got into E1.0 or what their frequency is.
These names occur in EO literature.
https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/compound.xml
This seems to be identical with essoil10.xml
https://github.com/petermr/CEVOpen/tree/master/dictionary/compound/raw This directory contains trial attempts to resolve synonyms, etc. It includes:
https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/raw/compoundSynonym.tsv synonyms added from Wikidata/Pubchem - most do not occur in EO literature and tend to pollute the dictionary.
https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/raw/compound1.xml
Probably obsolete.
https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/raw/compoundSynonymTable.tsv
a huge list of synonyms, now deleted.
https://github.com/petermr/CEVOpen/blob/master/dictionary/compound/raw/uniqueCompSynonym20190910.tsv
These are most of the compounds in the tables in oil186
. There were aggregated into a multiset which was then used to populate a new dictionary.
https://github.com/petermr/CEVOpen/blob/master/searches/oil186/__tables/compound_multiset_raw.txt
starts:
x 66
α-Pinene x 28
Limonene x 28
β-Pinene x 26
Caryophyllene oxide x 25
Linalool x 24
γ-Terpinene x 23
Camphene x 23
Camphor x 21
p-Cymene x 19
Sabinene x 19
α-Terpineol x 18
Spathulenol x 18
1,8-Cineole x 17
Unidentified x 17
δ-Cadinene x 17
α-Copaene x 16
Borneol x 16
Bornyl acetate x 16
α-Humulene x 16
β-Caryophyllene x 16
α-Terpinene x 15
Myrcene x 15
α-Phellandrene x 15
Terpinolene x 14
α-Thujene x 14
Germacrene D x 14
Terpinen-4-ol x 13
Total x 13
β-pinene x 12
Note that this is not case-sensitive so we have both β-Pinene x 26
and β-pinene x 12
. There are also random white-space variants.
There's a total of 1737 records. Many of the singletons are rubbish (numbers, fragments, etc.) So we take those with at least 2 occurrences.
lines 381-391:
β-Gurjunene x 2
cis-β-Guaiene x 2
Toluene x 2
α-terpinyl acetate x 2
<<start of singletons>>
(Z), (E)- α-farnesene << included space
Camphora . << appended superscript
cis-Chrysanthenyl acetate
Octilin
1,8-cineole (20.1), α-thujone (25.1), β-thujone (22.9), camphor (10.5) . << unparsed list
allo-Ocimene
That gives 385 probable names and 1350 singletons.
A larger but less analyzed set of 1000 articles:
https://github.com/petermr/CEVOpen/blob/master/searches/oil1000/__tables/compound_multiset1.txt
starts:
Limonene x 108
α-Pinene x 93
Linalool x 91
Caryophyllene oxide x 87
β-Pinene x 85
Camphene x 79
Sabinene x 72
γ-Terpinene x 69
β-Caryophyllene x 67
Spathulenol x 64
p-Cymene x 62
and starts singletons at 972:
γ-Amorphene x 2
Hungary x 2
Protein x 2
8-Heptadecene x 2
<<singletons>>
Acetovanillone
E-Pinocarveol
98.06
Geranialdehyde
(Z), (E)- α-farnesene
2-Ethyl-1-hexyl acetate
Methyl hexadecanoate
2300
So we took 975 terms and used these as the basis of a new dictionary
1297 terms from oil186
(not sure how selected) were used to search compound.xml
dictionary.
This gave https://github.com/petermr/CEVOpen/blob/master/searches/oil186/__tables/foundNotFound.txt which shows terms found and not found:
Cannot find term in dictionary (+)-cedrol
Cannot find term in dictionary (+)-curcuphenol
Cannot find term in dictionary (+)-fenchol
Cannot find term in dictionary (+)-α-terpineol
Cannot find term in dictionary (+/-)-norephedrine
...
<985>
Cannot find term in dictionary λ-gurjunene
Cannot find term in dictionary ρ-cymene
Cannot find term in dictionary ρ-cymenea,b
Cannot find term in dictionary τ-cadinol
Cannot find term in dictionary τ-muurolol
<the following were in E1.0 and where possible have Wikidata Ids>
found: (-)-caryophyllene oxide
found: (-)-limonene
found: (e)-2-hexenal
found: (e)-2-nonenal
found: (e)-2-octenal
<995>
...
<1292>
found: viridiflorol
found: vulgarol b
found: vulgarone b
found: yomogi alcohol
found: zingiberene
found: zonarene
<1297>
So the first chunk is subjected to further lookup against Wikidata and Pubchem.
notFound
in Wikidata and PubchemLargely through lookup (partially automated) @ambarishK gets:
Compound | WIKIDATA_query_id | notes | PubChem_CID | PubChem_cmpdname |
---|---|---|---|---|
butanoic acid | butyric acid (Q193213) | 264 | Butyric acid | |
cumarin | coumarin (Q111812) : CumarinLow German | 323 | Coumarin | |
para-cymen-7-ol | 325 | 4-Isopropylbenzyl alcohol | ||
p-cymen-7-ol | 325 | 4-Isopropylbenzyl alcohol | ||
cuminaldehyde | cuminaldehyde (Q419952) | 326 | 4-Isopropylbenzaldehyde | |
cuminal | cuminaldehyde (Q419952) : cuminal | 326 | 4-Isopropylbenzaldehyde | |
octanal | caprylaldehyde (Q416673) : n-octanal | 454 | Octanal |
... (980 lines)
NOTE:
notes
is occasionally used.
The wikidata column needs separation of name and id.
This will be the basis of the next version of the dictionary.
@petermr two questions for you re: Dictionaries.md
1) Re: description text for Microorganisms dictionary... should I omit Viruses from the definition since we’re not focusing on that right now, or leave it in and see what the data delivers?
[In this version of the Dictionary?] Microorganisms are the bacteria, fungi, yeasts and molds, protozoa, algae, or viruses upon which the experiments were conducted to determine what effect (Activities) EOs may have on them.
2) Owing to this article I recently read, it may be important that we distinguish Gram-positive and Gram-negative wherever possible as a separate field/column. Pos vs Neg may be an important search term in and of itself.
It is particularly worrying, says the WHO, that there are no new drugs imminent against gram-negative bacteria, which can cause pneumonia, bloodstream infections, wound or surgical site infections and meningitis.
Of the 50 antibiotics in the pipeline, 32 target pathogens listed by the WHO in 2017 as a global priority. But most of the drugs have only limited benefits when compared with existing antibiotics. Only two are active against the multi-drug resistant, gram-negative bacteria, which, says the WHO, are spreading rapidly and require urgent solutions.
Gram-negative bacteria, such as Klebsiella pneumoniae and Escherichia coli, can cause severe and often deadly infections. They are a particular threat for people with weak or under-developed immune systems, including newborn babies, ageing populations, and people undergoing surgery and cancer treatment.
On Wed, Jan 22, 2020 at 3:39 PM Emanuel Faria notifications@github.com wrote:
@petermr https://github.com/petermr two questions for you re: Dictionaries.md https://github.com/petermr/CEVOpen/blob/master/BJOC/Dictionaries.md
- Re: description text for Microorganisms dictionary... should I omit Viruses from the definition since we’re not focusing on that right now, or leave it in and see what the data delivers?
If the papers mention viruses we keep them in. The only reason for exclusion is the long-tail - singleton mentions.
[In this version of the Dictionary?] Microorganisms are the bacteria, fungi, yeasts and molds, protozoa, algae, or viruses upon which the experiments were conducted to determine what effect (Activities) EOs may have on them.
They are the organisms used as targets. More generallly we should use "target species". The only reason for using "microorganisms" is that the article uses that term. But they should be "medical" - so I think mosquitos, helminths, etc are worth keeping, but not targets of pheromones.
- Owing to this article https://www.theguardian.com/business/2020/jan/17/big-pharma-failing-to-invest-in-new-antibiotics-says-who I recently read, it may be important that we distinguish Gram-positive and Gram-negative wherever possible as a separate field/column. Pos vs Neg may be an important search term in and of itself.
No. The dictionaries link to Wikipedia.
It is particularly worrying, says the WHO, that there are no new drugs imminent against gram-negative bacteria, which can cause pneumonia, bloodstream infections, wound or surgical site infections and meningitis.
Of the 50 antibiotics in the pipeline, 32 target pathogens listed by the WHO in 2017 as a global priority. But most of the drugs have only limited benefits when compared with existing antibiotics. Only two are active against the multi-drug resistant, gram-negative bacteria, which, says the WHO, are spreading rapidly and require urgent solutions.
Gram-negative bacteria, such as Klebsiella pneumoniae and Escherichia coli, can cause severe and often deadly infections. They are a particular threat for people with weak or under-developed immune systems, including newborn babies, ageing populations, and people undergoing surgery and cancer treatment.
This may be useful when we write the intro, but not now.
The exercise is to summarise what we have got.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/64?email_source=notifications&email_token=AAFTCS7WJXZJJ5TWVDIGID3Q7BSE3A5CNFSM4JTZHSAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJUBGUA#issuecomment-577246032, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYRIWWAV3LWGYHM2STQ7BSE3ANCNFSM4JTZHSAA .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
They are the organisms used as targets. More generally we should use "target species". The only reason for using "microorganisms" is that the article uses that term. But they should be "medical" - so I think mosquitos, helminths, etc are worth keeping, but not targets of pheromones.
Changes made. Thanks.
@petermr , can we now close this issue?
The dictionary compilation and construction is evolving and needs further cleaning and aggregation. The history is messy. However the goal is relatively simple:
If the compound is not an EO component that's a small false positive and probable does little harm If there are false negatives they will need to be added later and this could be slightly messy. At this stage we are still working out the strategy for synonyms and fuzziness so we may have to refactor later.
The sources are:
EssoilDB1.0
This contains about 8000 unique names but many are typos . We cleaned this and created about 2100 unique names (but some may be synonyms).
oil186
This is 186 OA articles on EOs. Most have a
composition
table. We extracted the names from thecomposition
tables , selecting