Open axiomsofchoice opened 4 years ago
Initial investigations with getpapers
, using the Europe PMC RESTful API:
$ getpapers -k 5 --query 'respirator' --outdir facemasks -l debug --api eupmc
Suggest extending the tool by copying and modifying eupmc.js
so that the following is possible:
$ node bin\getpapers.js -k 5 --query 'respirator' --outdir sfacemasks -l debug --api royalsociety
Initially it will download full text PDFs as proof of concept. From what I can see of the eupmc.js
source getpapers
doesn't do anything with the resources it downloads, presumably to hand this over to ami
(see here and here).
Some inspection of the page source reveals the following is a basic search query request:
https://royalsocietypublishing.org/action/doSearch?text1=respirator&openAccess=18
More complex search queries are certainly possible.
Seeing an endpoint in the Javascript source code that would accept ajax=true
I did a little investigation to see if either XML or JSON were available content types, without much success:
$ curl -v --cookie cookie.txt --cookie-jar cookie.txt -H 'Accept: application/xml' -H 'Content-Type: application/xml' 'https://royalsocietypublishing.org/action/doSearch?ajax=true&text1=respirator&openAccess=18' > searchResults.html
The page that is returned is clearly rendered using XSLT since results are given as:
<!-- XSLT: SearchResults.xsl, Revision: 368960, Interface: urn:atypon.com:nlm:interface:atypon-nlm-article-info, was cached: Wed Mar 18 05:21:30 PDT 2020-->
<li class="clearfix separator search__item">
...
</li>
<!-- END OF XSLT -->
So extracting entries and downloading full text URLs should not pose much of an issue.
Finally there would be a requirement to page through results (see class="pagination"
).
Another useful link to be aware of is here.
Correct. getpapers
creates a directory with PMC* child directories, each with a fulltext.pdf
. The api
is switchable so you could have a --api rs
. Note that the deafult is --api epmc
.
Test Quickscrape - I haven't used for some years. https://github.com/ContentMine/quickscrape It should be possible to build a RoyalSociety scraper.
P.
Since the Royal Society results aren't lazy-loaded I think for now getpapers
is the better framework to start from. I've just pushed some initial code that downloads search results here. There are still some unfixed bugs (due to it's Europe PMC heritage) and features missing (pagination and full text download).
Found a paper on COVID-19 https://doi.org/10.1098/rsos.191420
run with scraper dir
quickscrape --url https://doi.org/10.1098/rsos.191420 --scraperdir ../../journal-scrapers/ --output rs-191420 --outformat bibjson
info: quickscrape 0.4.7 launched with...
info: - URL: https://doi.org/10.1098/rsos.191420
info: - Scraperdir: /Users/pm286/projects/journal-scrapers
info: - Rate limit: 3 per minute
info: - Log level: info
error: the scraper directory provided did not contain any valid scrapers
Guessing that rs.json
is a RS scraper...
$ quickscrape --url https://doi.org/10.1098/rsos.191420 --scraper ../../journal-scrapers/scrapers/rs.json --output rs-191420 --outformat bibjson
info: quickscrape 0.4.7 launched with...
info: - URL: https://doi.org/10.1098/rsos.191420
info: - Scraper: /Users/pm286/projects/journal-scrapers/scrapers/rs.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: https://doi.org/10.1098/rsos.191420
info: [scraper]. URL rendered. https://doi.org/10.1098/rsos.191420.
info: URL processed: captured 0/10 elements (10 captures failed)
info: all tasks completed
pm286macbook:openVirus pm286$ ls
INSTALLING.md biorxiv_medrxiv searches
README.md diary socdist
Wishlist.md dictionaries textIndexing
assets rs-191420 wikiPackageTesting.R
pm286macbook:openVirus pm286$ tree rs-191420/
rs-191420/
└── https_doi.org_10.1098_rsos.191420
├── bib.json
└── results.json
1 directory, 2 files
pm286macbook:openVirus pm286$ more rs-191420/https_doi.org_10.1098_rsos.191420/bib.json
{
"link": [],
"journal": {},
"sections": {},
"date": {},
"identifier": [],
"log": [
{
"date": "2020-03-29T22:12:47+01:00",
"event": "scraped by quickscrape v0.4.7"
}
]
}
pm286macbook:openVirus pm286$ more rs-191420/https_doi.org_10.1098_rsos.191420/results.json
{
"fulltext_pdf": {
"value": []
},
"fulltext_html": {
"value": []
},
"title": {
"value": []
},
"author": {
"value": []
},
"date": {
"value": []
},
"doi": {
"value": []
},
"volume": {
"value": []
},
"issue": {
"value": []
},
"firstpage": {
"value": []
},
"description": {
"value": []
}
}
This hasn't worked - will revisit the scraper. (NOTE, tried example in the QS docs and it worked )
quickscrape --url https://peerj.com/articles/384 --scraper ../../journal-scrapers/scrapers/peerj.json --output peerj-384
info: quickscrape 0.4.7 launched with...
info: - URL: https://peerj.com/articles/384
info: - Scraper: /Users/pm286/projects/journal-scrapers/scrapers/peerj.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: https://peerj.com/articles/384
info: [scraper]. URL rendered. https://peerj.com/articles/384.
info: [scraper]. download started. fig-1-full.png.
info: [scraper]. download started. fulltext.pdf.
info: [scraper]. download started. fulltext.html.
info: URL processed: captured 27/34 elements (7 captures failed)
info: all tasks completed
pm286macbook:openVirus pm286$ tree peerj-384/
peerj-384/
└── https_peerj.com_articles_384
├── fig-1-full.png
├── fulltext.html
├── fulltext.pdf
└── results.json
1 directory, 4 files
pm286macbook:openVirus pm286$
RS had changed and I had zero captures. Revisit in the morning.
Implement a scraper using
getpapers
for Royal Society Publishing. Initially focus on Open Access papers in order to get full text.