petermr / openVirus

aggregation of scholarly publications and extracted knowledge on viruses and epidemics.
The Unlicense
67 stars 17 forks source link

Scraper for Royal Society Publishing #29

Open axiomsofchoice opened 4 years ago

axiomsofchoice commented 4 years ago

Implement a scraper using getpapers for Royal Society Publishing. Initially focus on Open Access papers in order to get full text.

axiomsofchoice commented 4 years ago

Initial investigations with getpapers, using the Europe PMC RESTful API:

$ getpapers -k 5 --query 'respirator' --outdir facemasks -l debug --api eupmc

Suggest extending the tool by copying and modifying eupmc.js so that the following is possible:

$ node bin\getpapers.js -k 5 --query 'respirator' --outdir sfacemasks -l debug --api royalsociety

Initially it will download full text PDFs as proof of concept. From what I can see of the eupmc.js source getpapers doesn't do anything with the resources it downloads, presumably to hand this over to ami (see here and here).

Some inspection of the page source reveals the following is a basic search query request:

https://royalsocietypublishing.org/action/doSearch?text1=respirator&openAccess=18

More complex search queries are certainly possible.

Seeing an endpoint in the Javascript source code that would accept ajax=true I did a little investigation to see if either XML or JSON were available content types, without much success:

$ curl -v --cookie cookie.txt --cookie-jar cookie.txt -H 'Accept: application/xml' -H 'Content-Type: application/xml' 'https://royalsocietypublishing.org/action/doSearch?ajax=true&text1=respirator&openAccess=18' > searchResults.html

The page that is returned is clearly rendered using XSLT since results are given as:

<!-- XSLT: SearchResults.xsl, Revision: 368960, Interface: urn:atypon.com:nlm:interface:atypon-nlm-article-info, was cached: Wed Mar 18 05:21:30 PDT 2020-->
<li class="clearfix separator search__item">
   ...
</li>
<!-- END OF XSLT -->

So extracting entries and downloading full text URLs should not pose much of an issue.

Finally there would be a requirement to page through results (see class="pagination").

Another useful link to be aware of is here.

petermr commented 4 years ago

Correct. getpapers creates a directory with PMC* child directories, each with a fulltext.pdf. The api is switchable so you could have a --api rs . Note that the deafult is --api epmc.

petermr commented 4 years ago

Test Quickscrape - I haven't used for some years. https://github.com/ContentMine/quickscrape It should be possible to build a RoyalSociety scraper.

P.

axiomsofchoice commented 4 years ago

Since the Royal Society results aren't lazy-loaded I think for now getpapers is the better framework to start from. I've just pushed some initial code that downloads search results here. There are still some unfixed bugs (due to it's Europe PMC heritage) and features missing (pagination and full text download).

petermr commented 4 years ago

quickscrape on RS

Found a paper on COVID-19 https://doi.org/10.1098/rsos.191420 run with scraper dir

quickscrape --url https://doi.org/10.1098/rsos.191420 --scraperdir ../../journal-scrapers/ --output rs-191420 --outformat bibjson
info: quickscrape 0.4.7 launched with...
info: - URL: https://doi.org/10.1098/rsos.191420
info: - Scraperdir: /Users/pm286/projects/journal-scrapers
info: - Rate limit: 3 per minute
info: - Log level: info
error: the scraper directory provided did not contain any valid scrapers

Guessing that rs.json is a RS scraper...

$ quickscrape --url https://doi.org/10.1098/rsos.191420 --scraper ../../journal-scrapers/scrapers/rs.json --output rs-191420 --outformat bibjson
info: quickscrape 0.4.7 launched with...
info: - URL: https://doi.org/10.1098/rsos.191420
info: - Scraper: /Users/pm286/projects/journal-scrapers/scrapers/rs.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: https://doi.org/10.1098/rsos.191420
info: [scraper]. URL rendered. https://doi.org/10.1098/rsos.191420.
info: URL processed: captured 0/10 elements (10 captures failed)
info: all tasks completed

pm286macbook:openVirus pm286$ ls
INSTALLING.md       biorxiv_medrxiv     searches
README.md       diary           socdist
Wishlist.md     dictionaries        textIndexing
assets          rs-191420       wikiPackageTesting.R

pm286macbook:openVirus pm286$ tree rs-191420/
rs-191420/
└── https_doi.org_10.1098_rsos.191420
    ├── bib.json
    └── results.json

1 directory, 2 files
pm286macbook:openVirus pm286$ more rs-191420/https_doi.org_10.1098_rsos.191420/bib.json 
{
  "link": [],
  "journal": {},
  "sections": {},
  "date": {},
  "identifier": [],
  "log": [
    {
      "date": "2020-03-29T22:12:47+01:00",
      "event": "scraped by quickscrape v0.4.7"
    }
  ]
}
pm286macbook:openVirus pm286$ more rs-191420/https_doi.org_10.1098_rsos.191420/results.json 
{
  "fulltext_pdf": {
    "value": []
  },
  "fulltext_html": {
    "value": []
  },
  "title": {
    "value": []
  },
  "author": {
    "value": []
  },
  "date": {
    "value": []
  },
  "doi": {
    "value": []
  },
  "volume": {
    "value": []
  },
  "issue": {
    "value": []
  },
  "firstpage": {
    "value": []
  },
  "description": {
    "value": []
  }
}

This hasn't worked - will revisit the scraper. (NOTE, tried example in the QS docs and it worked )

 quickscrape   --url https://peerj.com/articles/384   --scraper ../../journal-scrapers/scrapers/peerj.json   --output peerj-384
info: quickscrape 0.4.7 launched with...
info: - URL: https://peerj.com/articles/384
info: - Scraper: /Users/pm286/projects/journal-scrapers/scrapers/peerj.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: https://peerj.com/articles/384
info: [scraper]. URL rendered. https://peerj.com/articles/384.
info: [scraper]. download started. fig-1-full.png.
info: [scraper]. download started. fulltext.pdf.
info: [scraper]. download started. fulltext.html.
info: URL processed: captured 27/34 elements (7 captures failed)
info: all tasks completed
pm286macbook:openVirus pm286$ tree peerj-384/ 
peerj-384/
└── https_peerj.com_articles_384
    ├── fig-1-full.png
    ├── fulltext.html
    ├── fulltext.pdf
    └── results.json

1 directory, 4 files
pm286macbook:openVirus pm286$ 
petermr commented 4 years ago

RS had changed and I had zero captures. Revisit in the morning.