petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
27 stars 19 forks source link

đź“• Documentation: Dictionary.xml and DictionaryDescription.md of: eoPlantMaterialHistory #81

Open EmanuelFaria opened 4 years ago

EmanuelFaria commented 4 years ago

Here we describe the process of creating a [DictionaryName]DictionaryDescription.md document, within which we will describe the contents of the individual dictionary (named in the title of this Issue), which was created (or is in the process of being created) from data collected for Oil186.

I will begin this thread by pasting the contents of the INDEX description, then follwed by first draft copy below for discussion and direction.

EmanuelFaria commented 4 years ago

Plant Growth/Collection/Processing Methods

I don’t know how to best name/describe this dictionary. It’s shares entries with other dictionaries.

 

ProcessDictionaryDescription.md

EmanuelFaria commented 4 years ago

Process​​​​ Dictionary

 

A dictionary of [XX] plant processes from which Essential Oils — mentioned in the 186 test articles downloaded from PubMed — were harvested.

 

File Data

 

Table Column Headings

 

Contents/Results

 

Notes:

EmanuelFaria commented 4 years ago

Since my last post, the following has been accomplished:

  1. The title of this Dictionary and it's description document has been changed to Plant Material History

  2. To better identify and organize the data, I have created two new columns in the table (bulleted below), and categorized the terms accordingly (Note that these could also be useful as dependent drop-down lists in a database):

    • PlantHistoryCat1 — The MAIN category of differentiation of the types of data related to the Plant Material History

    • PlantHistoryCat2 — The SUB category of differentiation of the types of data related to the Plant Material History

  3. There is still some doubt/ambiguity to clear up about some of the WikidataID numbers. To be able to discuss this efficiently, I have created the following new columngs

    • /link/@wikidata — hyperlink to a wikidata page or search results for the item in question

    • /wikiIDconfidence — my colour-coded “confidence rating” (Low=RED, Medium=yellow, Green=HIGH) on how well the WikidataID matches the entry term.

    • /desc/@wikidata — in some places, I have pasted in the description supplied on the correlating Wikidata page

    • /desc/@wikipedia — in some cases, there was no machine term in wikidata, but there was in wikipedia. I have copied some of the wikipedia descriptions for the term here.

    • Notes — Where I have listed some questions to discuss with [@petermr]

   PDF and xlsx documents attached here for reference and discussion] PlantMaterialHistory20200202.pdf PlantMaterialHistory20200202.xlsx

Next, I will start a new comment to itemize the Questions/Problems to be resolved.

petermr commented 4 years ago

On Mon, Feb 3, 2020 at 3:53 PM Emanuel Faria notifications@github.com wrote:

Since my last post, the following has been accomplished:

1.

The title of this Dictionary and it's description document has been changed to Plant Material History

Good, Please remove any spaces from this Best is camelcase plantMaterialHistory

1. 2.

To better identify and organize the data, I have created two new columns in the table (bulleted below), and categorized the terms accordingly (Note that these could also be useful as dependent drop-down lists in a database):

  *PlantHistoryCat1* — The *MAIN* category of differentiation of the
  types of data related to the Plant Material History
  -

  *PlantHistoryCat2* — The *SUB* category of differentiation of the
  types of data related to the Plant Material History
  3.

There is still some doubt/ambiguity to clear up about some of the WikidataID numbers. To be able to discuss this efficiently, I have created the following new columngs

  */link/@Wikidata <https://github.com/Wikidata>* — hyperlink to a
  wikidata page or search results for the item in question
  -

  */wikiIDconfidence* — my colour-coded “confidence rating” (Low=RED,
  Medium=yellow, Green=HIGH) on how well the WikidataID matches the entry
  term.
  -

  */desc/@Wikidata <https://github.com/Wikidata>* — in some places, I
  have pasted in the description supplied on the correlating Wikidata page
  -

  */desc/@Wikipedia <https://github.com/Wikipedia>* — in some cases,
  there was no machine term in wikidata, but there was in wikipedia. I have
  copied some of the wikipedia descriptions for the term here.
  -

  *Notes* — Where I have listed some questions to discuss with [
  @petermr <https://github.com/petermr>]

Looks good - will discuss when we talk.

1.

PDF and xlsx documents attached here for reference and discussion] PlantMaterialHistory20200202.pdf https://github.com/petermr/CEVOpen/files/4148738/PlantMaterialHistory20200202.pdf PlantMaterialHistory20200202.xlsx https://github.com/petermr/CEVOpen/files/4148739/PlantMaterialHistory20200202.xlsx

Next, I will start a new comment to itemize the Questions/Problems to be resolved.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/81?email_source=notifications&email_token=AAFTCS64QSBEXG3FMVQBP3TRBA4XTA5CNFSM4KMMCLDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKULEVI#issuecomment-581481045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4O2MH3CFI42FZUGJLRBA4XTANCNFSM4KMMCLDA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

Please remove any spaces from this Best is camelcase plantMaterialHistory

Done. Thanks

EmanuelFaria commented 4 years ago

Questions/Problems remaining to be resolved.

 

WikiDataID:

  1. Many wikidataIDs lead to a page with the exact term searched, but the page lacks any data whatsoever.

  2. Many terms have no eactly corresponding wikidata page — do we create one … even as an placeholder as above (in item #1)

 

DAVE.[item].# IDs

  1. How will we use these, exactly?

  2. As I have added new items, will all of these will need to be renumbered from top to bottom?

    1. If so, by which column(s) shall I sort them?

    2. Hand-serializing these IDs that contain words, dots, and numbers is time-consuming and can introduce errors. Would it not be better to just use serialized numbers? That way, a future database can begin from wherever we left off at any time.

    3. What happens when we add new entries in the future — for example, for Synonyms or for those related by Cat1 and Cat2 grouping?

      1. Should the new IDs follow the ones they’re most related to?

      2. Do we set up the IDs to include a “.” between category groupings to indicate Cat1 and Cat2 etc? (Example below)

DAVE.PlantMaterialHistory.GC.S.1   Growth Conditions Season winter
DAVE.PlantMaterialHistory.GC.S.2   Growth Conditions   spring
DAVE.PlantMaterialHistory.GC.S.3       Summer

What to do with Un-ID-able terms?

In the original table, there were some terms that seem impossible to expect would have Wikidata or Wikipedia entries (examples below). Should these be deleted, or “tagged” in some way as useful “semantic” phrases?

 

Low- and Medium-Confidence WikiIDs

One-by-one, I checked the WikiIds in the original table. While many were obviously correct, some were less so.

I have corrected and added all that I confidently could, but there are still others that pose problems such as:

  1. IDs that link to papers that mention the term, but are not specifically related to the term. Examples:

    1. Water distillation https://www.wikidata.org/wiki/Q274959

      1. this ID relates to Distilled Water, not the process of oil extraction via water distillation, as no such entry exists
    2. Hydrodistillation https://www.wikidata.org/wiki/Q64097733

      1. This ID leads to a page about an article entitled: Extraction of Essential Oils of L. by Two Different Methods: Hydrodistillation and Microwave Assisted Hydrodistillation — there is no page specifically for hydro distillation
    3. Steam Distillation https://www.wikidata.org/wiki/Q1164392

      1. This ID leads to a page about Stem Distillation, but there is no other information on the page

 

Some disambiguation found here:https://www.researchgate.net/post/What_is_different_between_water_steam_distillation_and_steam_distillation_system  

In water distillation or hydro distillation, elevated pressures is used with plants whose essential oils are difficult to extract at higher temperatures.

In steam distillation, plant material is placed into a steam distillation chamber. Steam is forced into the chamber with it. As the essential oil interacts with the steam, the steam flows into the chilled condensed chamber, turning back into a liquid, providing the essential oil.

Hydro distillation with Clevenger trap is used for the extraction of volatile oil(essential oil) and steam distillation is used in Industries for the isolation of volatile oil. 

The advantage of steam distillation is that the plant material can be recovered after oil extraction for solvent extraction for the isolation of other non volatile compounds whereas in hydro distillation the plant material is continuously boiled and not possible to recover. For large scale distillation handling of water is also not convenient.  

Recovery of oil is higher in hydro distillation compared to steam distillation. 

 

Challenges choosing the “best” WikiID

 

Example:

 

“Missing” Terms

While trying to disambiguate terms for “Drying Methods”, I found there were many other drying methods that were not in our dictionary. I was tempted to add them, in case that helps identify more instances in the literature, or if we are letting our coding build the entirety of the Dictionary. What shall we do?

 

Example 1: this page on wikipedia :

https://en.wikipedia.org/wiki/Drying

In the most common case, a gas stream, e.g., air, applies the heat by convection and carries away the vapor as humidityhttps://en.wikipedia.org/wiki/Humidity. Other possibilities are vacuum dryinghttps://en.wikipedia.org/wiki/Vacuum_drying, where heat is supplied by conductionhttps://en.wikipedia.org/wiki/Heat_conduction or radiationhttps://en.wikipedia.org/wiki/Radiation (or microwaveshttps://en.wikipedia.org/wiki/Microwaves), while the vapor thus produced is removed by the vacuumhttps://en.wikipedia.org/wiki/Vacuum system. Another indirect technique is drum dryinghttps://en.wikipedia.org/wiki/Drum_drying (used, for instance, for manufacturing potato flakes), where a heated surface is used to provide the energy, and aspirators draw the vapor outside the room. In contrast, the mechanical extraction of the solvent, e.g., water, by filtrationhttps://en.wikipedia.org/wiki/Filtration or centrifugationhttps://en.wikipedia.org/wiki/Centrifugation, is not considered "drying" but rather "draining".

 

Example two: this one on researchgate.net https://www.researchgate.net/post/Which_drying_methods_are_practiced_to_dry_plant_biomass_of_spices_agricultural_horticultural_medicinal_and_aromatic_plants

Traditionally agricultural/horticultural crops, spices, medicinal and aromatic plants and other plant products are dried in shade or Sun. Subsequently hot-air oven drying, solar drier drying, cross-flow drying, through-flow drying, vacuum shelf drying etc. techniques have been employed. Recently microwave drying, freeze drying, infrared or inert gas drying and combo drying techniques have also been used. What other methods are in practice and what are their advantages and disadvantages?

 

How to Categorize and Subcategorize certain terms as pulled from the literature?

Example:

  1. "solvent extraction“ is it an extraction technique, or does the word “solvent” make it an extraction component?

    • solvent extraction Q866399

 

  1. Is this an EO extraction technique or a plant extract?

    • "conventionally distilled oil” (no wikiID)
EmanuelFaria commented 4 years ago

Progress Update

I have just committed the finished (I hope!) dictionary: PlantMaterialHistory.xml

The good news

The slightly annoying news @petermr I can't get the dictionary to open in XML Notepad. Using an online syntax-checker, seems to be a hidden character causing problems (See screenshot) I've spent more time fiddling with this than I did entering the data. Please take a look — and when you fix it — please let me know how you did it.

Thanks, @mannyrules

petermr commented 4 years ago

The dictionary is well-thought out. I have made some stylistic changes - e.g. leading chars should be lowercase and reserved words should not have spaces. Your toolchain has made a complete mess of the file. In future you shouldn't use any tool for editing dictionaries unless we have jointly agreed it. I have edited out the null characters, spurious quotes, etc. The more slick a tool is the more likely it is to have strange characters. My guess is that Excel was used at some stage. See if XMLNotepad can read, edit and save to current dictionary without corruption. It should be OK.

P.

petermr commented 4 years ago

Your comments on resolving to Wikidata, and adding concepts are good. However I think we should leave it as it is, UNLESS you are able to find an authority (e.g. USDA) which already has a glossary. Not high priority. These terms will not map prettily to Wikidata. some are fine. some are very broad.

EmanuelFaria commented 4 years ago

Sounds good. Let's discuss on our next call.

petermr commented 4 years ago

am around before 1700 UTC

On Wed, Feb 5, 2020 at 7:20 PM Emanuel Faria notifications@github.com wrote:

Sounds good. Let's discuss on our next call.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/81?email_source=notifications&email_token=AAFTCS4XTV62PH5DKA5SWJ3RBMGQZA5CNFSM4KMMCLDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK4URYA#issuecomment-582568160, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4FPB625QH7NZ57DNDRBMGQZANCNFSM4KMMCLDA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

You have nicely developed a simple https://en.wikipedia.org/wiki/Faceted_classification for the dictionary. Categories ay need relabelling.

On Tue, Feb 4, 2020 at 9:16 PM Emanuel Faria notifications@github.com wrote:

Progress Update

I have just committed the finished (I hope!) dictionary: PlantMaterialHistory.xml https://github.com/petermr/CEVOpen/blob/master/dictionary/plantmaterialhistory/plantmaterialhistory.xml

The good news

The slightly annoying news @petermr https://github.com/petermr I can't get the dictionary to open in XML Notepad. Using an online syntax-checker, seems to be a hidden character causing problems (See screenshot https://www.dropbox.com/s/rzs1zwyu6r15c3a/Screenshot%202020-02-04%2018.03.50.png?dl=0) I've spent more time fiddling with this than I did entering the data. Please take a look — and when you fix it — please let me know how you did it.

Thanks, @mannyrules https://github.com/mannyrules

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/81?email_source=notifications&email_token=AAFTCS7WKRALQIFOYXM77L3RBHLLRA5CNFSM4KMMCLDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZG4TA#issuecomment-582118988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS7KOA55FXWABOLQDHDRBHLLRANCNFSM4KMMCLDA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

Thanks thanks @petermr. I'm around today. Skype me when you get in.

petermr commented 4 years ago

maybe 30 min?

On Thu, Feb 6, 2020 at 1:03 PM Emanuel Faria notifications@github.com wrote:

Thanks thanks @petermr https://github.com/petermr. I'm around today. Skype me when you get in.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/81?email_source=notifications&email_token=AAFTCS4XTYXBBOHUKBZZRN3RBQDBXA5CNFSM4KMMCLDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK7ESCY#issuecomment-582895883, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS2Q6VYPJDO7NPEWUZ3RBQDBXANCNFSM4KMMCLDA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

PlantMaterialHistory.xml and PlantMaterialHistoryDictionaryDescription.md are now updated and working. I have also updated master INDEXofOIL186Dictionaries.md

I would like to go back, however, and move the items having to do with distillation methods to a new separate dictionary. I will comment here again when that is done.

EmanuelFaria commented 4 years ago

@petermr Looking more closely at this dictionary, I think that besides separating out a new dictionary for distillation methods, we could also create a separate dictionary for plant growth stages. This may be overkill, but I just found this article entitled, "Whole-Plant Growth Stage Ontology for Angiosperms and Its Application in Plant Biology" http://www.plantphysiol.org/content/142/2/414 where (if I've read this right) they have identified 112 "active terms".

EmanuelFaria commented 4 years ago

@petermr Now that we have a stand-alone dictionary for EO Extraction Methods, I have deleted the ones that were in PlantMaterialHistory.xml, renumbered the DAVEids, and updated the PlantMaterialHistoryDictionaryDescription.md as well as the master Index

Here's the updated Dictionary Entry:

Extraction Method Dictionary

 

Description of Table and its Contents:

A dictionary of 73 terms for Essential Oil extraction methods.

 

File Data

https://github.com/petermr/CEVOpen/blob/master/dictionary/ExtractionMethod/ExtractionMethod.xml

 

 

Table Column Headings

 

Contents/Results

EmanuelFaria commented 4 years ago

As of today, I believe this dictionary and it's description document are complete. Below I will copy the contents of the description document:

EO Plant Material History​​​​​ Dictionary

File Data

 

Table Column Headings

 

Contents/Results

 

Notes: