petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Simpler way of Isolating Tables for Extraction? #68

Closed EmanuelFaria closed 4 years ago

EmanuelFaria commented 4 years ago

I'm not certain, but It may be useful to extract tables from each article's unique table-only URL by replacing the article ID and Table number in the ncbi URL format (as below).

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5497343/table/Tab3/

https://www.ncbi.nlm.nih.gov/pmc/articles/ [ARTICLE_ID] /table/Tab [TABLE_X:X+1] /

What do you think @petermr ? Does this help in any way?

EmanuelFaria commented 4 years ago

Update: It doesn't work on every article. Example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5788217/table/Tab2/ doesn't work.

petermr commented 4 years ago

Good idea, but as you note , it's not universal. It depends on the publisher - some tables are embedded in the article, some are separate, some are both.

Are you available in 1 hour? 1615 UTC/GMT? Like to discuss extracting activity tables.

P.

On Wed, Dec 11, 2019 at 2:55 PM Emanuel Faria notifications@github.com wrote:

Update: It doesn't work on every article. Example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5788217/table/Tab2/ doesn't work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/68?email_source=notifications&email_token=AAFTCS3X62G43J5TQ2H37H3QYD5O5A5CNFSM4JZQH4B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGTNO2I#issuecomment-564582249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSY4EHX7K26HMMPYZR3QYD5O5ANCNFSM4JZQH4BQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

I'm available. Call when ready

EmanuelFaria commented 4 years ago

CORRECTION! @petermr There does seem to be a way to get to JUST the tables in any article — which MAY make data selection and extraction simpler.

For instance, the example I used above that DIDN'T follow the table-only URL structure I thought I'd discovered https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5788217/table/Tab2/ DOES work, in the following manner....

This article — https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5788217/ — just as in all articles I tested so far, has hyperlinks that I expected to be page anchors (i.e. "(Table 1)", "(Table 2)" ... etc.). It turns out that clicking those links open a new window displaying just the table.

Example of text preceding the table in the article referenced above:

Antibacterial properties of ZEO and SNP The inhibitory effects of ZEO and SNP alone and in combination against Staph. aureus and Salm. Typhimurium were investigated using microtiter plate assay. For Staph. aureus, the MIC values of ZEO and SNP were 1250 and 25 μg/mL and for Salm. Typhimurium the values were 2500 and 25 μg/mL, respectively. In all cases, MBC values were similar to MICs. The ZEO was found to be more effective on gram-positive than gram-negative bacteria whereas SNP displayed similar antibacterial activity on both bacteria. The MICs for SNP - ZEO combination were 0.78 and 12.5 μg/ mL against Staph. aureus and Salm. Typhimurium, respectively. ZEO-SNP combination inhibited S. aureus and Salm. Typhimurium at 625 μg/ mL. Based on the FICI scale (Table 2), the combination displayed a synergistic action on Staph. aureus (FICI=0.81) and Salm. Typhimurium (FICI= 0.75).

EmanuelFaria commented 4 years ago

On further inspection, it seems the URL leading to these table-only pages are built on the original article URL

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871294/table/table2-2156587217717414/

followed by: /table/table[table#]-[article doi]/

EmanuelFaria commented 4 years ago

OK.... I found one more thing that gets us to an even cleaner table....

Here are the steps I followed to what seems to be the solution:

  1. Clicking the text above Table 3 in this link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871294/ got me to this simpler table page: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871294/table/table3-2156587217717414/

  2. When I noticed the "Open in a separate window" text at the bottom, I clicked it and ended up with nothing on the page but a nice clean table, here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871294/table/table3-2156587217717414/?report=objectonly

So if I'm right, the URL to any cleanly extractable table should be:

https://www.ncbi.nlm.nih.gov/pmc/articles/**[ARTICLE_ID]**/table/**[TABLE#]**-**[ARTICLE_DOI]**/?report=objectonly

As a test, I changed "table3" to "table4" in this URL, and got exactly what I expected... even though there was no "Open in a separate window" text on the "(Table 4)" hyperlinked page in the article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871294/table/table3-2156587217717414/?report=objectonly https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871294/table/table4-2156587217717414/?report=objectonly

petermr commented 4 years ago

The tables were downloaded in XML but for some reason were not extracted. File this as an issue. We can only work with automatic downloads - not manual.

P.

On Fri, Dec 13, 2019 at 5:54 PM Emanuel Faria notifications@github.com wrote:

OK.... I found one more thing that gets us to an even cleaner table....

Here are the steps I followed to what seems to be the solution:

1.

Clicking the text above Table 3 in this link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871294/ got me to this simpler table page:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871294/table/table3-2156587217717414/ 2.

When I noticed the "Open in a separate window" text at the bottom, I clicked it and ended up with nothing on the page but a nice clean table, here:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871294/table/table3-2156587217717414/?report=objectonly

So if I'm right, the URL to any cleanly extractable table should be:

https://www.ncbi.nlm.nih.gov/pmc/articles/**[ARTICLE_ID]**/table/**[TABLE#]**-**[ARTICLE_DOI]**/?report=objectonly

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/68?email_source=notifications&email_token=AAFTCS3ZN3BYDCIEEJTA7RLQYPD45A5CNFSM4JZQH4B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG2XR4Y#issuecomment-565541107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4HXC434LK7XIRDVFLQYPD45ANCNFSM4JZQH4BQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

I don't know if there are XML versions of pages with just the tables. If so, we could scrape our regularly download XMLs for those table links, convert them to new URLs, and then download and scrape the simplified pages ...

If you could post a few sample XML page URLs, I could poke around a bit with it. (Assuming the end result would somehow make things easier/cleaner to work with).