petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Mapping Tables of Essential Oil Activities mentioned in our test batch of articles #45

Open petermr opened 4 years ago

petermr commented 4 years ago

The activity references have been added manually into: https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/activity20191028.tsv For any article there may be 0,1,2,3... activities (not normally more). For each activity there should be:

The activity table should list all triples for each paper. If the mentions and the tables are inconsistent note what has been omitted or duplicated.

The first few rows are:

MCID    activity    activity_method     activity_result     table_no    table_title     notes
PMC4391421  anti-microbial  Materials and methods >> Determination of antimicrobial activity    Results and Discussion  Table 2     Effects of thyme oil against bacteria expressed by the mean ...     [notes]
PMC5080681  antibacterial   Methods >> Antimicrobial tests:     Results>> Antibacterial and antifungal activities   Table 4     Antimicrobial activities of T. bovei essential oil  [notes]
PMC5080681  antifungal  Methods >> Antimicrobial tests:     Results>> Antibacterial and antifungal activities   [no]    [table_title]   [notes]
PMC5080681  Anthelmintic    Methods >> Anthelmintic activity    Results>> Anthelmintic activity     Table 2     Anthelmintic activity of T. bovei essential oil     [notes]
PMC5080681  Antioxidant activity    Methods >> DPPH radical-scavenging activity     Results >> Antioxidant activity :   Table 3     Percentage inhibition of DPPH activity by T. bovei extract a ...    [notes]
PMC5080681  antimicrobial activity  Methods >> Antimicrobial tests:     Results >> Antibacterial and antifungal activities  Table 4     Antimicrobial activities of T. bovei essential oil  [notes]

The title of the Table should match roughly with the measurement method and description of results.

This is messy because Tables may report more than one actvity (as here)

ambarishK commented 4 years ago

Sir, please go through the composition table - composition20191028.tsv

185 articles analysed.

EmanuelFaria commented 4 years ago

Got it. I'll dig into the new tools from Jon to create some tags and test some annotations. Then I'll see if I can indeed export them and post them here (somewhere) for you to let me know if they are useable.

I will also see which tools (grep, easyfind, or even spotlight) could best be used to maximize accuracy and speed and show you what I come up with for your feedback on how best to proceed most efficiently.

(My time is limited today, but intend to devote no less than 90 min to the above.)

Thanks for your guidance, Peter.

ambarishK commented 4 years ago

Sir, would you please brief about annotations. What sections you want to annotate? Let me get some idea.

petermr commented 4 years ago

Ambarish, please concentrate completely on your assignment on composition tables. We will not be using annotation for compounds at this stage. I will be making the decisions about sections and will post information here as it is required.

On Tue, Oct 29, 2019 at 3:46 PM Ambarish Kumar notifications@github.com wrote:

Sir, would you please brief about annotations. What sections you want to annotate?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/45?email_source=notifications&email_token=AAFTCS6GZGNFP7H3BFTCHHLQRBLE7A5CNFSM4JGAZTRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECRAJQQ#issuecomment-547488962, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZYO3XLMZTJ6P3JB23QRBLE7ANCNFSM4JGAZTRA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

Using GREP to search and log Essential Oil Activities mentioned in our test batch of articles

I am using GREP to update the table/log of activities (shown in the first entry of this issue) in our test batch of research articles currently in the Oil186 repository so that...

Goals: The Challenge, the solution we will bring, and the Desired End State by which all will know we have achieved excellence.

Steps to achieve the Goal(s):

  1. Because I don't know how to "code" GREP searches and cannot find an open source visual GREP tool, I am using this one called VisualGrep I purchased on the AppStore for $2.00
  2. I downloaded the contents of the OIL186 Repository to my mac
  3. I have set VisualGrep to search the CONTENTS of documents only within the directory now on my mac, and filter to search only file names that contain the word "scholarly" (i.e. files within subdirectories named "Scholarly.html"
  4. Because my aim is to simply create a table that lists even the SINGLE occurrence of an Activity term, NOT EVERY occurrence of the term, I set VisualGrep to stop at first match for the search term (In the example screenshot attached to this post, I am searching the for the word "Antibacterial"
  5. After executing the search, I sort by the "Content" column. This allows me to see where — what section of the article—the first occurrence of the activity is shown. If it is in the "Reference" section, it is not part of the study of the article, so I will not log it in the table.

Screenshot of my search results: VisualGREP search example

Desired Results: A clear and concise description / outline of the final "state or vision" of the project — the evidence we will see when our goals are achieved.

Excluding any activities found in articles under the heading “References”, I will log the following in each section of each article in the OIL186 repository:

I will also add any new activities, if any, to our Activities Dictionary.

EmanuelFaria commented 4 years ago

As described above, I have updated the table (now called OIL186-activity20191101.txt) to include the first occurrence only of any activities that were found in the activity dictionary — which I have updated (see ActivitiesNormalizedE1.020191101.txt) with newly found activities, as well as some notes to consider before we update the Activity Dictionary.

Both files are attached as tab-delimited txt files (I don't have the option to save as tsv)

petermr commented 4 years ago

A) Naming. There is no need to include OIL186 as its already in the directory tree. so OIL186-activity20191101.txt => activity20191101.txt B) this is a TSV file, so please rename to activity20191101.tsv (I have done so) C) There are FAR too many rows and columns in this. I do not understand either. There should be about 300 rows (one for each REPORTED ACTIVITY MEASUREMENT TABLE. Please stick to the template I started. D) Please keep rows in SORTED order (by PMCID) The purpose of this is to be a gold standard for extracting tables of activity.

Suggest we talk.

EmanuelFaria commented 4 years ago

Ohhhh.... this explains so much! 🤦🏻‍♂️

Oops. My bad.

I thought the tables product we last talked about was an entirely new task for me to complete.

What I did here was look for the occurance of every activity we had listed in our dictionary, and the article it's found in. The next column, I THOUGHT you wanted me to then annotate each phrase that described the biological/chemical method that activity was enacted/completed (whatever).

Oh well. The good news is, at leas the task I'm SUPPOSED to do is a much quicker task by comparison.

Let's get it sorted when we talk tomorrow Nov 4, 2019 .

Manny

EmanuelFaria commented 4 years ago

@petermr I'm having trouble figuring out how to map the table column headings as show in the scholarly.html files in OIL186 searches to a single row in the spreadsheet template we worked on together.

Some of the issues I'm finding include:

How do I handle things like the situation in the image below? We need to decide a rule to "mark down" some of these into something you can use.

For example, do I use:

Columnl1 = Microorganism (C.decurrens, C. sempervirens, T. articulata} or Column1 = Microorganism Column2 = C.decurrens (MIC90, MBC) Column3 = C. sempervirens (MIC90, MBC) Column4 = T. articulata (MIC90, MBC) or Column1 = Microorganism Column2 = MIC90 (C.decurrens) Column3 = MBC (C.decurrens) Column4 = MIC90 (C. sempervirens) Column5 = MBC (C. sempervirens) Column6 = MIC90 (T. articulata) Column7 = MBC (T. articulata) Column8 = Gentamycin Mean (µg/mL) ± Standard Deviation Column9 = Gentamycin Mean (µg/mL) ± Standard Deviation _But then I still don't know how/where to describe Mean (µL/mL) ± Standard Deviation for you_

This example is for PMC5423258 Original article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5423258/

Screenshot 2019-11-05 12 41 27

petermr commented 4 years ago

talk in 30 mins

On Tue, Nov 5, 2019 at 4:23 PM Emanuel Faria notifications@github.com wrote:

@petermr https://github.com/petermr I'm having trouble figuring out how to map the table column headings as show in the scholarly.html files in OIL186 searches to a single row in the spreadsheet template we worked on together.

Some of the issues I'm finding include:

  • Multiple Header rows
  • Some of the headings seem to have been merged fields covering two or more columns
  • Sometimes there is repetition of the column headings to cover different substances being tested

How do I handle things like the situation in the image below? We need to decide a rule to "mark down" some of these into something you can use.

For example, do I use:

Columnl1 = Microorganism (C.decurrens, C. sempervirens, T. articulata} or Column1 = Microorganism Column2 = C.decurrens (MIC90, MBC) Column3 = C. sempervirens (MIC90, MBC) Column4 = T. articulata (MIC90, MBC) or Column1 = Microorganism Column2 = MIC90 (C.decurrens) Column3 = MBC (C.decurrens) Column4 = MIC90 (C. sempervirens) Column5 = MBC (C. sempervirens) Column6 = MIC90 (T. articulata) Column7 = MBC (T. articulata) Column8 = Gentamycin Mean (µg/mL) ± Standard Deviation Column9 = Gentamycin Mean (µg/mL) ± Standard Deviation But then I still don't know how/where to describe Mean (µL/mL) ± Standard Deviation for you

This example is for PMC5423258 Original article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5423258/

[image: Screenshot 2019-11-05 12 41 27] https://user-images.githubusercontent.com/9612595/68223363-676ab080-ffcb-11e9-8c4f-6394d7d7e690.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/45?email_source=notifications&email_token=AAFTCS3SRDTH7KWBQ5DNV53QSGMWXA5CNFSM4JGAZTRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDDJLCQ#issuecomment-549885322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6KHMSJPX3T4CK4DC3QSGMWXANCNFSM4JGAZTRA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

The tables with multiple headers are not always properly rendered. Am working on that.

I think the best thing is to collect tables into issues.

Skype?

On Tue, Nov 5, 2019 at 4:23 PM Emanuel Faria notifications@github.com wrote:

@petermr https://github.com/petermr I'm having trouble figuring out how to map the table column headings as show in the scholarly.html files in OIL186 searches to a single row in the spreadsheet template we worked on together.

Some of the issues I'm finding include:

  • Multiple Header rows
  • Some of the headings seem to have been merged fields covering two or more columns
  • Sometimes there is repetition of the column headings to cover different substances being tested

How do I handle things like the situation in the image below? We need to decide a rule to "mark down" some of these into something you can use.

For example, do I use:

Columnl1 = Microorganism (C.decurrens, C. sempervirens, T. articulata} or Column1 = Microorganism Column2 = C.decurrens (MIC90, MBC) Column3 = C. sempervirens (MIC90, MBC) Column4 = T. articulata (MIC90, MBC) or Column1 = Microorganism Column2 = MIC90 (C.decurrens) Column3 = MBC (C.decurrens) Column4 = MIC90 (C. sempervirens) Column5 = MBC (C. sempervirens) Column6 = MIC90 (T. articulata) Column7 = MBC (T. articulata) Column8 = Gentamycin Mean (µg/mL) ± Standard Deviation Column9 = Gentamycin Mean (µg/mL) ± Standard Deviation But then I still don't know how/where to describe Mean (µL/mL) ± Standard Deviation for you

This example is for PMC5423258 Original article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5423258/

[image: Screenshot 2019-11-05 12 41 27] https://user-images.githubusercontent.com/9612595/68223363-676ab080-ffcb-11e9-8c4f-6394d7d7e690.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/45?email_source=notifications&email_token=AAFTCS3SRDTH7KWBQ5DNV53QSGMWXA5CNFSM4JGAZTRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDDJLCQ#issuecomment-549885322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6KHMSJPX3T4CK4DC3QSGMWXANCNFSM4JGAZTRA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

Hi Peter,

As I continue interpreting the tables into our "table formula/equation”, will you please double-check my “facts” below so I can keep going with confidence?

I'm looking at table 2 for https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5203915/ (see screenshot attached below)

Assuming we are counting the fungal and bacterial species being tested together, my guess is APA(O2C3A2S6P3) …

Otherwise, separating out the bacterial and fungal activities… for bacteria, value for S would be 5 = APA(O2C3A2S5P3) for fungus, value for S would be 1 = APA(O2C3A2S1P3)

By the way, Do you still want me to separate species types being tested in ONE table into SEPARATE tables?

All of this assumes I have the following correct, and haven’t left anything out:

O = Essential Oil(s) tested C = Control(s) used (if any) A = Activity(ies) tested S = Species being tested P = Parameters = number of measurement types

… Any corrections for me?

Thanks!

Manny Screen Shot 2019-11-07 at 11 27 32 AM

EmanuelFaria commented 4 years ago

If I committed correctly, I just put my updates into Activity_Tables_Breakdown_2019-11-07.tsv into articleAnalysis/oil186/raw

EmanuelFaria commented 4 years ago

@petermr As I was generating "table-description formulas" (from which you will create regex/GREP search functions by which you will parse future tables into machine-readable data), I realized it may help to see images of similar tables side by side so that variations within them could more easily appear -- along with solutions to the regex challenges.

So here's what I've done:

  1. Took screenshots of all activity tables in articles PMC4391421 to PMC5622390 (more to come, if you find this useful).
  2. Named the table image files: "ArticleID_Tx". (T=Table, x= table number)
  3. Added a new directory here: https://github.com/petermr/CEVOpen/tree/master/articleAnalysis/oil186/raw/Example_Table_Images/
  4. Inside that directory, added the following sub-folders:
    • __table_images_to_Sort
    • APA
    • GRID
    • IRREGULAR

@petermr if you pull this directory down to your Mac, sort the images into the appropriate folders, we may save some time extracting meanings and methods from them.

After sorting, you might also choose to delete all that are redundant, and I/we can then focus on generating "table-description formulas" for the remainder.

I'm sure you could think of other possibilities too.

Please let me know what you think... and if I should proceed adding more screenshots for the rest of the oil186 articles.

Thanks!

Manny

petermr commented 4 years ago

Thanks Sounds useful I have fixed the bug in displaying HTML tables and will commit them

On Fri, 8 Nov 2019, 00:30 Emanuel Faria, notifications@github.com wrote:

@petermr https://github.com/petermr As I was generating "table-description formulas" (from which you will create regex/GREP search functions by which you will parse future tables into machine-readable data), I realized it may help to see images of similar tables side by side so that variations within them could more easily appear -- along with solutions to the regex challenges.

So here's what I've done:

  1. Took screenshots of all activity tables in articles PMC4391421 to PMC5622390 (more to come, if you find this useful).
  2. Named the table image files: "ArticleID_Tx". (T=Table, x= table number)
  3. Added a new directory here: https://github.com/petermr/CEVOpen/tree/master/articleAnalysis/oil186/raw/Example_Table_Images/
  4. Inside that directory, added the following sub-folders:

    • __table_images_to_Sort
    • APA
    • GRID
    • IRREGULAR

@petermr https://github.com/petermr if you pull this directory down to your Mac, sort the images into the appropriate folders, we may save some time extracting meanings and methods from them.

After sorting, you might also choose to delete all that are redundant, and I/we can then focus on generating "table-description formulas" for the remainder.

I'm sure you could think of other possibilities too.

Please let me know what you think... and if I should proceed adding more screenshots for the rest of the oil186 articles.

Thanks!

Manny

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/45?email_source=notifications&email_token=AAFTCSZBNBQSV6PV7HYZIW3QSSXK5A5CNFSM4JGAZTRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDOJ7PI#issuecomment-551329725, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6E77T7PJ7TRY2UR6TQSSXK5ANCNFSM4JGAZTRA .