ropensci-archive / fulltext

:warning: ARCHIVED :warning: Search across and get full text for OA & closed journals
Other
272 stars 46 forks source link

Can I improve ft_get retrieval? #191

Closed LMAllenJacobson closed 4 years ago

LMAllenJacobson commented 5 years ago
Session Info ```r Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.5.2 (2018-12-20) os macOS Mojave 10.14.3 system x86_64, darwin15.6.0 ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2019-03-06 ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source aRxiv 0.5.16 2017-04-28 [1] CRAN (R 3.5.0) assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0) backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0) bib2df * 1.0.1 2018-06-02 [1] CRAN (R 3.5.2) bibliometrix * 2.1.1 2019-02-07 [1] CRAN (R 3.5.2) BibScan * 0.1.0 2019-02-26 [1] Github (Science-for-Nature-and-People/BibScan@361c567) bibtex 0.4.2 2017-06-30 [1] CRAN (R 3.5.0) broom 0.5.1 2018-12-05 [1] CRAN (R 3.5.0) callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.0) cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.5.0) cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.0) cluster 2.0.7-1 2018-04-13 [2] CRAN (R 3.5.2) colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0) crminer * 0.2.0 2018-10-15 [1] CRAN (R 3.5.0) crul 0.7.0 2019-01-04 [1] CRAN (R 3.5.2) curl 3.3 2019-01-10 [1] CRAN (R 3.5.2) data.table * 1.12.0 2019-01-13 [1] CRAN (R 3.5.2) desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0) devtools * 2.0.1 2018-10-26 [1] CRAN (R 3.5.2) digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0) dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2) DT 0.5 2018-11-05 [1] CRAN (R 3.5.0) factoextra 1.0.5 2017-08-22 [1] CRAN (R 3.5.0) FactoMineR 1.41 2018-05-04 [1] CRAN (R 3.5.0) fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.0) farver 1.1.0 2018-11-20 [1] CRAN (R 3.5.0) flashClust 1.01-2 2012-08-21 [1] CRAN (R 3.5.0) forcats * 0.4.0 2019-02-17 [1] CRAN (R 3.5.2) fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.0) fulltext * 1.2.0 2019-01-22 [1] CRAN (R 3.5.2) generics 0.0.2 2018-11-29 [1] CRAN (R 3.5.0) ggforce 0.1.3 2018-07-07 [1] CRAN (R 3.5.0) ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.0) ggraph 1.0.2 2018-07-07 [1] CRAN (R 3.5.0) ggrepel 0.8.0 2018-05-09 [1] CRAN (R 3.5.0) glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0) gridExtra 2.3 2017-09-09 [1] CRAN (R 3.5.0) gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.0) haven 2.1.0 2019-02-19 [1] CRAN (R 3.5.2) hms 0.4.2 2018-03-10 [1] CRAN (R 3.5.0) hoardr 0.5.2 2018-12-02 [1] CRAN (R 3.5.0) htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0) htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.0) httpcode 0.2.0 2016-11-14 [1] CRAN (R 3.5.0) httpuv 1.4.5.1 2018-12-18 [1] CRAN (R 3.5.0) httr 1.4.0 2018-12-11 [1] CRAN (R 3.5.0) humaniformat 0.6.0 2016-04-24 [1] CRAN (R 3.5.0) igraph 1.2.4 2019-02-13 [1] CRAN (R 3.5.2) jsonlite * 1.6 2018-12-07 [1] CRAN (R 3.5.0) later 0.8.0 2019-02-11 [1] CRAN (R 3.5.2) lattice 0.20-38 2018-11-04 [2] CRAN (R 3.5.2) lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.0) leaps 3.0 2017-01-10 [1] CRAN (R 3.5.0) listviewer * 2.1.0 2018-10-07 [1] CRAN (R 3.5.0) lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.0) magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0) MASS 7.3-51.1 2018-11-01 [2] CRAN (R 3.5.2) Matrix 1.2-15 2018-11-01 [2] CRAN (R 3.5.2) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0) microdemic 0.4.0 2018-10-25 [1] CRAN (R 3.5.0) mime 0.6 2018-10-05 [1] CRAN (R 3.5.0) miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 3.5.0) modelr 0.1.4 2019-02-18 [1] CRAN (R 3.5.2) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0) networkD3 0.4 2017-03-18 [1] CRAN (R 3.5.0) nlme 3.1-137 2018-04-07 [2] CRAN (R 3.5.2) pdftools 2.1 2019-01-16 [1] CRAN (R 3.5.2) pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0) pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0) pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0) plyr * 1.8.4 2016-06-08 [1] CRAN (R 3.5.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0) processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.0) promises 1.0.1 2018-04-13 [1] CRAN (R 3.5.0) ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0) purrr * 0.3.0 2019-01-27 [1] CRAN (R 3.5.2) R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2) rappdirs 0.3.1 2016-03-28 [1] CRAN (R 3.5.0) RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 3.5.0) Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0) rcrossref 0.9.0 2019-01-14 [1] CRAN (R 3.5.2) readr * 1.3.1 2018-12-21 [1] CRAN (R 3.5.0) readxl 1.3.0 2019-02-15 [1] CRAN (R 3.5.2) RefManageR * 1.2.0 2018-04-25 [1] CRAN (R 3.5.0) remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0) rentrez 1.2.1 2018-03-05 [1] CRAN (R 3.5.0) reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.5.0) RISmed 2.1.7 2017-06-06 [1] CRAN (R 3.5.0) rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) rplos 0.8.4 2018-08-14 [1] CRAN (R 3.5.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0) rscopus 0.6.3 2018-11-19 [1] CRAN (R 3.5.0) rstudioapi 0.9.0 2019-01-09 [1] CRAN (R 3.5.2) rvest * 0.3.2 2016-06-17 [1] CRAN (R 3.5.0) scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.0) scatterplot3d 0.3-41 2018-03-14 [1] CRAN (R 3.5.0) selectr 0.4-1 2018-04-06 [1] CRAN (R 3.5.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0) shiny 1.2.0 2018-11-02 [1] CRAN (R 3.5.0) shinycssloaders 0.2.0 2017-05-12 [1] CRAN (R 3.5.0) shinythemes 1.1.2 2018-11-06 [1] CRAN (R 3.5.0) SnowballC 0.6.0 2019-01-15 [1] CRAN (R 3.5.2) solrium 1.0.2 2018-12-13 [1] CRAN (R 3.5.0) storr 1.2.1 2018-10-18 [1] CRAN (R 3.5.0) stringdist 0.9.5.1 2018-06-08 [1] CRAN (R 3.5.0) stringi 1.3.1 2019-02-13 [1] CRAN (R 3.5.2) stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.5.2) tibble * 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) tidyr * 0.8.2 2018-10-28 [1] CRAN (R 3.5.0) tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0) tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.5.0) triebeard 0.3.0 2016-08-04 [1] CRAN (R 3.5.0) tweenr 1.0.1 2018-12-14 [1] CRAN (R 3.5.0) units 0.6-2 2018-12-05 [1] CRAN (R 3.5.0) urltools 1.7.2 2019-02-04 [1] CRAN (R 3.5.2) usethis * 1.4.0 2018-08-14 [1] CRAN (R 3.5.0) utf8 1.1.4 2018-05-24 [1] CRAN (R 3.5.0) viridis 0.5.1 2018-03-29 [1] CRAN (R 3.5.0) viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.5.0) whisker 0.3-2 2013-04-28 [1] CRAN (R 3.5.0) withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0) XML * 3.98-1.17 2019-02-08 [1] CRAN (R 3.5.2) xml2 * 1.2.0 2018-01-24 [1] CRAN (R 3.5.0) xtable 1.8-3 2018-08-29 [1] CRAN (R 3.5.0) yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.0) [1] /Users/lallenjacobson-mac/Library/R/3.5/library [2] /Library/Frameworks/R.framework/Versions/3.5/Resources/library ```
library("bib2df")
library("data.table")
library("fulltext")
library("purrr")

BuildBib

Goal: build a PDF library to automate the initial steps of meta-analysis. This will make meta-analysis more reproducible, more efficient, and hopefully more comprehensive. I've started by attempting to reproduce a published meta-analysis.

Problem: At 52%, ft_get() retrieves more PDFs than other automated options (e.g., Zotero, bibscan). However, I am still failing to retrieve articles that I have access to through my institution (University of Florida). JSTOR, Oxford U press, U Chicago Press, and Wiley are the major sources of loss.

I start by copying all references in the publication into a .txt file, each reference is on its own numbered line. Then I retrieve the DOIs for each reference using the following crossref service https://www.crossref.org/stqUpload/ Cross ref returns (via e.mail) a .html file with DOIs, I use the following citation finder to convert the .html to .bib http://git.macropus.org/citation-finder/

I read .bib as data.table

bib_df <- as.data.table(bib2df(file = "glazier2005DOI.bib"))
DOI <- bib_df[,DOI]

here is a list of the DOIs in my example:

Create vector of DOIs ```{r DOI data} DOI <- c("10.1016/0300-9629(80)90045-6", "10.1111/j.1365-2427.1982.tb00620.x", "10.2307/3543214", "10.1016/0022-0981(73)90030-0", "10.2307/1539269", "10.2307/1563678", "10.1016/0022-0981(78)90100-4", "10.1139/z64-016", "10.1139/z64-015", "10.1007/s003600050170", "10.1086/322965", "10.1007/s003600050220", "10.1093/jn/5.6.581", "10.1007/bf00142190", "10.2307/4444807", "10.1086/physzool.63.3.30156228", "10.4319/lo.1977.22.1.0108", "10.1016/s1095-6433(00)00351-2", "10.1002/jez.1402500215", "10.1086/physzool.45.1.30155926", "10.2307/1937827", "10.1016/0010-406x(62)90100-7", "10.1139/f73-068", "10.1016/0301-6226(82)90044-6", "10.1007/bf00009782", "10.1093/jn/121.suppl_11.s18", "10.1086/515917", "10.1111/j.1095-8649.1978.tb03426.x", "10.2307/2389690", "10.1016/0300-9629(94)90339-5", "10.1016/0300-9629(71)90276-3", "10.1086/physzool.55.2.30155850", "10.1007/bf00366299", "10.1016/0300-9629(82)90072-x", "10.1016/0010-406x(66)90199-x", "10.1016/0300-9629(82)90071-8", "10.1086/physzool.66.1.30158293", "10.1016/s0022-5193(86)80068-6", "10.1016/0010-406x(67)90255-1", "10.1016/0010-406x(69)90057-7", "10.1016/0300-9629(80)90365-5", "10.1111/j.1095-8649.1985.tb04017.x", "10.1086/physzool.37.4.30152756", "10.1111/j.1744-7348.1958.tb02226.x", "10.1007/bf00344853", "10.2307/1935183", "10.2307/1444056", "10.3354/meps243217", "10.1016/0300-9629(75)90059-6", "10.1079/bjn19720046", "10.1152/ajpregu.1984.247.5.r806", "10.1152/ajpregu.1987.252.3.r439", "10.3354/meps050013", "10.4319/lo.1991.36.2.0354", "10.1007/bf01875448", "10.1111/j.1365-3032.1997.tb01176.x", "10.1007/bf00005604", "10.1007/bf00379996", "10.1111/j.1095-8649.2000.tb00272.x", "10.1016/0034-5687(82)90046-9", "10.1093/jn/121.suppl_11.s8", "10.1016/0300-9629(81)92991-1", "10.1139/z79-277", "10.1242/jeb.00394", "10.1046/j.1095-8649.2003.00048.x", "10.1007/bf00344887", "10.2307/3543971", "10.1093/jn/3.2.177", "10.1093/jn/8.2.139", "10.1086/physzool.39.1.30152763", "10.1016/0300-9629(73)90258-2", "10.1098/rstb.1986.0023", "10.2307/2976", "10.1016/0022-0981(71)90016-5", "10.1016/0300-9629(77)90468-6", "10.1007/bf00684448", "10.1139/f71-253", "10.2307/1350546", "10.1007/bf00297958", "10.1126/science.134.3495.2033", "10.1007/bf00346410", "10.1163/187529274x00591", "10.1007/bf00386903", "10.1007/bf00592305", "10.1016/0300-9629(87)90430-0", "10.1139/f72-270", "10.1093/jn/18.5.473", "10.1016/0300-9629(73)90019-4", "10.1007/bf00346295", "10.1016/s1095-6433(03)00145-4", "10.2307/1942479", "10.1016/0022-0981(84)90109-6", "10.1086/316715", "10.1086/639616", "10.1016/0306-4565(77)90013-4", "10.1126/science.66.1709.289", "10.1007/bf00338583", "10.1098/rspb.2003.2347", "10.1016/0300-9629(87)90323-9", "10.1086/physzool.46.4.30155609", "10.2307/1933575", "10.1111/j.1095-8649.1991.tb03136.x", "10.4319/lo.1971.16.1.0086", "10.1093/icb/9.2.418", "10.2307/3544118", "10.1139/f95-278", "10.1016/0010-406x(69)91334-6", "10.1016/0010-406x(62)90031-2", "10.2307/1936532", "10.1007/bf00345740", "10.2307/2390277", "10.1086/physzool.52.1.30159931", "10.1016/0300-9629(80)90184-x", "10.2307/1540305", "10.4319/lo.1978.23.3.0461", "10.1016/0300-9629(81)90644-7", "10.1086/physzool.32.1.30152287", "10.1016/0300-9629(95)00055-c", "10.2307/1446195", "10.2307/1446195", "10.1086/639605", "10.2527/jas1976.433692x", "10.1016/0034-5687(73)90045-5", "10.1016/s0022-1910(99)00036-0", "10.2307/1542703", "10.1086/physzool.17.1.30151829", "10.2307/1933448", "10.1007/bf00738417", "10.1086/physzool.41.4.30155477", "10.1111/j.1095-8649.2004.00374.x", "10.1086/physzool.63.6.30152639", "10.3354/meps253233", "10.1016/0022-1910(79)90025-8", "10.1016/s1095-6433(02)00344-6") ```

add keys to r environment and retrieve full-text pdfs

SPRINGER_KEY <- "mykey"
crossref_email <- "myemail"
CROSSREF_TDM <- "mykey"
ELSEVIER_SCOPUS_KEY <- "mykey"
ENTREZ_KEY <- "mykey"
MICROSOFT_ACADEMIC_KEY <- "mykey"

bib_ft <- ft_get(DOI, progress = TRUE)

build summary table

retrieved <- stack((map(bib_ft, 1)))
dois_queried <- map(bib_ft, 2)
queried <- stack(lapply(dois_queried, function(x) length(x)))
summary <- merge(x = retrieved, y = queried, all=TRUE, by = "ind")
setnames(summary, old=c("values.x","values.y"), new=c("retrieved", "queried"))
source retrieved queried
aaas 0 2
american_physiological_society 2 2
brill 0 1
cambridge_university_press_cup 0 1
canadian_science_publishing 4 7
elsevier 36 36
inter_research_science_center 3 3
jstor 0 10
oxford_university_press_oup 0 8
springer_nature 20 20
the_company_of_biologists 0 1
the_royal_society 0 2
university_of_california_press 0 1
university_of_chicago_press 5 20
wiley 0 20
sckott commented 5 years ago

thanks very much for this @LMAllenJacobson

First, as long as you are already writing code to do this work, you can get DOIs from citations using either fulltext or rcrossref, e.,g.,

# a title from one of the first DOI 10.1016/0300-9629(80)90045-6
txt = "Effect of body size and temperature on oxygen uptake in the water snakes Helicops modestus and Liophis miliaris (colubridae)"
res = ft_search(txt, from = "crossref")
# it does come back with many results, but the top match is the article in question
res$crossref$data$title[1]
#> [1] "Effect of body size and temperature on oxygen uptake in the water snakes Helicops modestus and Liophis miliaris (colubridae)"
res$crossref$data$doi[1]
#> [1] "10.1016/0300-9629(80)90045-6"
sckott commented 5 years ago

one thing that may help is to summarize errors:, e.g.,

bib_ft <- ft_get(DOI, progress = TRUE)
library(dplyr)
bind_rows(lapply(bib_ft, "[[", "errors"), .id = "publisher") %>% 
  filter(!is.na(error))

giving a data.frame of all the errors across publishers, and filtering out records that had no errors (i.e. those are the ones you got the full text for)

interpreting errors is a bit tricky. for example an error like

type was supposed to be pdf, but was text/html; charset=UTF-8

is not very transparent. essentially, the link that was tried was for a pdf, and the content type that was returned (html) did not represent pdf type. this in my experience indicates that the publisher likely threw a "you don't have access" page. It's hard to definitely say this programatically in this package though


I did the same requests for these DOIs, and got about the same results. Some of them I think there's no way to fix, that is we just don't have access any of the ways provided in this pkg. Of course there are other ways that I won't mention here, and that aren't included in this package.

Oxford was particularly sluggish, and I often ctrl+C out of those requests, proceeding to the next.

The Royal Society looks promising though. E.g., the DOI 10.1098/rstb.1986.0023 goes to https://royalsocietypublishing.org/doi/abs/10.1098/rstb.1986.0023 - and at least I have access to that PDF. Looks like the link that Crossref gives is bad, and indeed, the error message from this pkg was no link found from Crossref. So perhaps I can make some logic for this specific publisher to get the right link

JSTOR is another interesting one: the DOI 10.2307/3543214 goes to https://www.jstor.org/stable/3543214 , the error recorded is no link found from Crossref, and indeed Crossref doesn't have a link for that => http://api.crossref.org/works/10.2307/3543214 - When you click on download pdf on that JSTOR page, they throw a pop up window, probably making programmatic access especially nasty. I don't think there's any way to solve this, sorry. You might try https://github.com/ropensci/jstor for JSTOR articles

Most of Univ. of Chicago press articles could not be downloaded, not much can be done about that that I know of.

For Wiley, almost all of those worked for me. Are you sure you had your Crossref TDM key setup right? Elsevier did work for you though, so that would suggest Wiley woul work too .

sckott commented 5 years ago

The Cambridge scraper is going to require access https://github.com/ropensci/fulltext/issues/195 - requires to scrape the link from the page when access is granted, then get pdf (the pdf url is a mess, and can't be sorted out just from DOI, etc)

LMAllenJacobson commented 5 years ago

Thanks for your detailed feedback. I haven't had a chance to incorporate your suggestions because I've been trying to fix a coulple different problems:

First Apparently, I've done something that is interfering with ft_get(). I am logged on through the same VPN, I am using the same api keys, and the same code. I had updated two packages (purrr and XML). I've downgraded purrr, but cannot seem to downgrade XML to the original version. Here are the updated session details:

Session Info for original computer ```r ─ Session info ────────────────────────────────────────────────────────────────── setting value version R version 3.5.2 (2018-12-20) os macOS Mojave 10.14.3 system x86_64, darwin15.6.0 ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2019-03-11 ─ Packages ────────────────────────────────────────────────────────────────────── ! package * version date lib source aRxiv 0.5.16 2017-04-28 [1] CRAN (R 3.5.0) assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0) backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0) bib2df * 1.0.1 2018-06-02 [1] CRAN (R 3.5.2) bibtex 0.4.2 2017-06-30 [1] CRAN (R 3.5.0) callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.0) cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.0) colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0) crul 0.7.0 2019-01-04 [1] CRAN (R 3.5.2) curl 3.3 2019-01-10 [1] CRAN (R 3.5.2) data.table * 1.12.0 2019-01-13 [1] CRAN (R 3.5.2) desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0) devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.2) digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0) dplyr 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2) DT 0.5 2018-11-05 [1] CRAN (R 3.5.0) fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.0) fulltext * 1.2.0 2019-01-22 [1] CRAN (R 3.5.2) ggplot2 3.1.0 2018-10-25 [1] CRAN (R 3.5.0) glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0) gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.0) hoardr 0.5.2 2018-12-02 [1] CRAN (R 3.5.0) htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0) htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.0) httpcode 0.2.0 2016-11-14 [1] CRAN (R 3.5.0) httpuv 1.4.5.1 2018-12-18 [1] CRAN (R 3.5.0) httr 1.4.0 2018-12-11 [1] CRAN (R 3.5.0) humaniformat 0.6.0 2016-04-24 [1] CRAN (R 3.5.0) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.0) later 0.8.0 2019-02-11 [1] CRAN (R 3.5.2) lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.0) lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.0) magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0) microdemic 0.4.0 2018-10-25 [1] CRAN (R 3.5.0) mime 0.6 2018-10-05 [1] CRAN (R 3.5.0) miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 3.5.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0) pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0) pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0) pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0) plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0) processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.0) promises 1.0.1 2018-04-13 [1] CRAN (R 3.5.0) ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0) V purrr * 0.3.1 2019-01-27 [1] CRAN (R 3.5.2) R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2) rappdirs 0.3.1 2016-03-28 [1] CRAN (R 3.5.0) Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0) rcrossref 0.9.0 2019-01-14 [1] CRAN (R 3.5.2) remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0) rentrez 1.2.1 2018-03-05 [1] CRAN (R 3.5.0) reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.5.0) rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) rplos 0.8.4 2018-08-14 [1] CRAN (R 3.5.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0) rstudioapi 0.9.0 2019-01-09 [1] CRAN (R 3.5.2) scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0) shiny 1.2.0 2018-11-02 [1] CRAN (R 3.5.0) solrium 1.0.2 2018-12-13 [1] CRAN (R 3.5.0) storr 1.2.1 2018-10-18 [1] CRAN (R 3.5.0) stringi 1.3.1 2019-02-13 [1] CRAN (R 3.5.2) stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.2) tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0) triebeard 0.3.0 2016-08-04 [1] CRAN (R 3.5.0) urltools 1.7.2 2019-02-04 [1] CRAN (R 3.5.2) usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0) whisker 0.3-2 2013-04-28 [1] CRAN (R 3.5.0) withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0) V XML 3.98-1.19 2019-02-08 [1] CRAN (R 3.5.2) xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0) xtable 1.8-3 2018-08-29 [1] CRAN (R 3.5.0) yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.0) [1] /Users/lallenjacobson-mac/Library/R/3.5/library [2] /Library/Frameworks/R.framework/Versions/3.5/Resources/library V ── Loaded and on-disk version mismatch. ```

I use the progress bar, but now it takes a long time for the progress bar to appea (~60 min). When the function was working, this was not the case. When ft_get() does finish, it only retrieves ~3% of the full-text articles; in this case, most of the DOIs are not recognized and are assigned as unknown. Most recently, the function failled and returned the following error:

Error ```r |=================================================================================================| 100% |================ | 17%Error in (if (compress) gzfile else file)(tmp, "wb") : cannot open the connection In addition: Warning messages: 1: no plugin for Crossref member '1121' yet 2: no plugin for Crossref member '155' yet 3: In (if (compress) gzfile else file)(tmp, "wb") : cannot open compressed file '/Users/lallenjacobson-mac/Library/Caches/R/fulltext_storr/scratch/file1a1485f743d', probable reason 'No such file or directory' ```

To troubleshoot, I tried running the code on a different computer, and ft_get() works. I've also included the session details from that computer:

Session Info for new computer ```r ─ Session info ──────────────────────────────────────────────────────────────────────────── setting value version R version 3.4.2 (2017-09-28) os macOS High Sierra 10.13.6 system x86_64, darwin15.6.0 ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz date 2019-03-11 ─ Packages ──────────────────────────────────────────────────────────────────────────────── package * version date lib source aRxiv 0.5.16 2017-04-28 [1] CRAN (R 3.4.0) assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.4.0) backports 1.1.2 2017-12-13 [1] CRAN (R 3.4.3) bib2df * 1.0.1 2018-06-02 [1] CRAN (R 3.4.2) bibtex 0.4.2 2017-06-30 [1] CRAN (R 3.4.1) callr 3.1.1 2018-12-21 [1] CRAN (R 3.4.4) cli 1.0.1 2018-09-25 [1] CRAN (R 3.4.4) colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.4.4) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.4.1) crminer 0.2.0 2018-10-15 [1] CRAN (R 3.4.4) crul 0.7.0 2019-01-04 [1] CRAN (R 3.4.4) curl 3.3 2019-01-10 [1] CRAN (R 3.4.4) data.table * 1.12.0 2019-01-13 [1] CRAN (R 3.4.4) desc 1.2.0 2018-05-01 [1] CRAN (R 3.4.4) devtools 2.0.1 2018-10-26 [1] CRAN (R 3.4.2) digest 0.6.18 2018-10-10 [1] CRAN (R 3.4.4) dplyr 0.8.0.1 2019-02-15 [1] CRAN (R 3.4.4) DT 0.5 2018-11-05 [1] CRAN (R 3.4.4) evaluate 0.10.1 2017-06-24 [1] CRAN (R 3.4.1) fs 1.2.6 2018-08-23 [1] CRAN (R 3.4.4) fulltext * 1.2.0 2019-01-22 [1] CRAN (R 3.4.4) ggplot2 3.1.0 2018-10-25 [1] CRAN (R 3.4.4) glue 1.3.0 2018-07-17 [1] CRAN (R 3.4.4) gtable 0.2.0 2016-02-26 [1] CRAN (R 3.4.0) hoardr 0.5.2 2018-12-02 [1] CRAN (R 3.4.4) htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.4.0) htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.4.4) httpcode 0.2.0 2016-11-14 [1] CRAN (R 3.4.0) httpuv 1.4.5.1 2018-12-18 [1] CRAN (R 3.4.4) httr 1.4.0 2018-12-11 [1] CRAN (R 3.4.4) humaniformat 0.6.0 2016-04-24 [1] CRAN (R 3.4.0) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.4.4) knitr 1.20 2018-02-20 [1] CRAN (R 3.4.3) later 0.8.0 2019-02-11 [1] CRAN (R 3.4.4) lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.4.2) lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.4.4) magrittr 1.5 2014-11-22 [1] CRAN (R 3.4.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.4.0) microdemic 0.4.0 2018-10-25 [1] CRAN (R 3.4.4) mime 0.6 2018-10-05 [1] CRAN (R 3.4.4) miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 3.4.4) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.4.4) pdftools 2.1 2019-01-16 [1] CRAN (R 3.4.4) pillar 1.3.1 2018-12-15 [1] CRAN (R 3.4.4) pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.4.4) pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.4.4) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.4.4) plyr 1.8.4 2016-06-08 [1] CRAN (R 3.4.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.4.0) processx 3.2.1 2018-12-05 [1] CRAN (R 3.4.4) promises 1.0.1 2018-04-13 [1] CRAN (R 3.4.4) ps 1.3.0 2018-12-21 [1] CRAN (R 3.4.4) purrr * 0.3.1 2019-03-03 [1] CRAN (R 3.4.4) R6 2.4.0 2019-02-14 [1] CRAN (R 3.4.4) rappdirs 0.3.1 2016-03-28 [1] CRAN (R 3.4.0) Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.4.4) rcrossref 0.9.0 2019-01-14 [1] CRAN (R 3.4.4) remotes 2.0.2 2018-10-30 [1] CRAN (R 3.4.4) rentrez 1.2.1 2018-03-05 [1] CRAN (R 3.4.4) reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.4.3) rlang 0.3.1 2019-01-08 [1] CRAN (R 3.4.4) rplos 0.8.4 2018-08-14 [1] CRAN (R 3.4.4) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.4.3) rstudioapi 0.7 2017-09-07 [1] CRAN (R 3.4.1) scales 1.0.0 2018-08-09 [1] CRAN (R 3.4.4) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.4.4) shiny 1.2.0 2018-11-02 [1] CRAN (R 3.4.4) solrium 1.0.2 2018-12-13 [1] CRAN (R 3.4.4) storr 1.2.1 2018-10-18 [1] CRAN (R 3.4.4) stringi 1.3.1 2019-02-13 [1] CRAN (R 3.4.4) stringr 1.4.0 2019-02-10 [1] CRAN (R 3.4.4) tibble 2.0.1 2019-01-12 [1] CRAN (R 3.4.4) tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.4.4) triebeard 0.3.0 2016-08-04 [1] CRAN (R 3.4.0) urltools 1.7.2 2019-02-04 [1] CRAN (R 3.4.4) usethis 1.4.0 2018-08-14 [1] CRAN (R 3.4.4) whisker 0.3-2 2013-04-28 [1] CRAN (R 3.4.0) withr 2.1.2 2018-03-15 [1] CRAN (R 3.4.4) XML 3.98-1.11 2018-04-16 [1] CRAN (R 3.4.4) xml2 1.2.0 2018-01-24 [1] CRAN (R 3.4.3) xtable 1.8-3 2018-08-29 [1] CRAN (R 3.4.4) yaml 2.2.0 2018-07-25 [1] CRAN (R 3.4.4) [1] /Library/Frameworks/R.framework/Versions/3.4/Resources/library ```

Do you have any troubleshooting tips?

Second, my preferences if for ft_get() to return .xml. Which I receive for about half of the results. However, these files do not include the full text. Am I missing something?

sckott commented 5 years ago

We don't depend on the XML package or the purrr package in this package, so I imagine those only affect the user code you showed above.

The error about "cannot open the connection" seems to me to suggest that you may not have full admin access on that computer. is that the case? You can set the cache to be stored elsewhere, try that and see if you get the same result.

The first computer has a newer version of OSX, and up to date R version, whereas the second computer has an older OSX and R versions. If those were the same you could eliminate those potential issues.

ft_get() can only return xml content if the publisher provides it. Most publishers do not provide XML; most probably have the XML, but many of them choose no to make it available. If there's cases you know of where you know the XML is avail. but you aren't getting it then let me know.

LMAllenJacobson commented 5 years ago

You are right, I do not have admin access on my work computer. I just set the cache to a folder within my documents but received the same error. But, I do not think the error is related to admin privelages because I originally had this working on my work computer.

I will update my personal computer and rerun this evening.

Regarding the XML, I understand that some publishers do not provide this format, but when XML is provided, should this file include the full text? I receive two types of documents: PDFs of the full text and XML with article meta data (links, title, publication). For example, see the attached (I could not attach a .xml, so I've changed the extension to .txt).

10_1016_0022_1910_85_90024_1.txt

sckott commented 5 years ago

the xml issue: it can happen with Elsevier that you get XML back and everything seems fine, but it's just metadata. This should be due to you not having access to the article.

LMAllenJacobson commented 5 years ago

I just checked two examples, and I had full-text access to both. Both are from Elsevier, one from 1962 and one from 1985. Maybe the issue is that these papers are older?

sckott commented 5 years ago

Checked the one that you shared above, ft_get("10.1016/0022-1910(85)90024-1") and that worked for me when on my VPN. got the full XML. If you share the other DOI I can check that one. It's entirely possible we have access to different subsets of their articles

Does the ft_get output give any errors in the $errors slot?

LMAllenJacobson commented 5 years ago

For this example:

LMAllenJacobson commented 5 years ago

Here is the DOI for the other example, from 1962: https://doi.org/10.1016/0010-406X(62)90031-2

LMAllenJacobson commented 5 years ago

I've updated R and packages on both computers. I haven't had the time to update the OS. If you think updating will help to solve the problem, I can do that later this week. Now, I get a new error, but it is the same for both computers:

New Error ```r > bib_ft <- ft_get(bib_df[,DOI], progress = TRUE) |=====================================================================================================================| 100% |=====================================================================================================================| 100% |=====================================================================================================================| 100% |=====================================================================================================================| 100% |=====================================================================================================================| 100% |=====================================================================================================================| 100% |=====================================================================================================================| 100%Error in sprintf(pat, x) : too few arguments In addition: There were 24 warnings (use warnings() to see them) ```
Session Info for original computer ```r > devtools::session_info() ─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.5.2 (2018-12-20) os macOS Mojave 10.14.3 system x86_64, darwin15.6.0 ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2019-03-12 ─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source aRxiv 0.5.16 2017-04-28 [1] CRAN (R 3.5.0) askpass 1.1 2019-01-13 [1] CRAN (R 3.5.2) assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0) backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0) bib2df * 1.0.1 2018-06-02 [1] CRAN (R 3.5.2) bibtex 0.4.2 2017-06-30 [1] CRAN (R 3.5.0) callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.0) cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.0) colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0) crminer 0.2.0 2018-10-15 [1] CRAN (R 3.5.0) crul 0.7.0 2019-01-04 [1] CRAN (R 3.5.2) curl 3.3 2019-01-10 [1] CRAN (R 3.5.2) data.table * 1.12.0 2019-01-13 [1] CRAN (R 3.5.2) desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0) devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.2) digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0) dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2) DT 0.5 2018-11-05 [1] CRAN (R 3.5.0) fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.0) fulltext * 1.2.0 2019-01-22 [1] CRAN (R 3.5.2) ggplot2 3.1.0 2018-10-25 [1] CRAN (R 3.5.0) glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0) gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.0) hoardr 0.5.2 2018-12-02 [1] CRAN (R 3.5.0) htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0) htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.0) httpcode 0.2.0 2016-11-14 [1] CRAN (R 3.5.0) httpuv 1.4.5.1 2018-12-18 [1] CRAN (R 3.5.0) httr 1.4.0 2018-12-11 [1] CRAN (R 3.5.0) humaniformat 0.6.0 2016-04-24 [1] CRAN (R 3.5.0) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.0) later 0.8.0 2019-02-11 [1] CRAN (R 3.5.2) lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.0) lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.0) magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0) microdemic 0.4.0 2018-10-25 [1] CRAN (R 3.5.0) mime 0.6 2018-10-05 [1] CRAN (R 3.5.0) miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 3.5.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0) pdftools 2.2 2019-03-10 [1] CRAN (R 3.5.2) pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0) pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0) pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0) plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0) processx 3.3.0 2019-03-10 [1] CRAN (R 3.5.2) promises 1.0.1 2018-04-13 [1] CRAN (R 3.5.0) ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0) purrr * 0.3.1 2019-03-03 [1] CRAN (R 3.5.2) qpdf 1.1 2019-03-07 [1] CRAN (R 3.5.2) R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2) rappdirs 0.3.1 2016-03-28 [1] CRAN (R 3.5.0) Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0) rcrossref 0.9.0 2019-01-14 [1] CRAN (R 3.5.2) remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0) rentrez 1.2.1 2018-03-05 [1] CRAN (R 3.5.0) reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.5.0) rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) rplos 0.8.4 2018-08-14 [1] CRAN (R 3.5.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0) rstudioapi 0.9.0 2019-01-09 [1] CRAN (R 3.5.2) scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0) shiny 1.2.0 2018-11-02 [1] CRAN (R 3.5.0) solrium 1.0.2 2018-12-13 [1] CRAN (R 3.5.0) storr 1.2.1 2018-10-18 [1] CRAN (R 3.5.0) stringi 1.3.1 2019-02-13 [1] CRAN (R 3.5.2) stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.2) tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0) triebeard 0.3.0 2016-08-04 [1] CRAN (R 3.5.0) urltools 1.7.2 2019-02-04 [1] CRAN (R 3.5.2) usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0) whisker 0.3-2 2013-04-28 [1] CRAN (R 3.5.0) withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0) XML 3.98-1.19 2019-03-06 [1] CRAN (R 3.5.2) xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0) xtable 1.8-3 2018-08-29 [1] CRAN (R 3.5.0) yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.0) [1] /Users/lallenjacobson-mac/Library/R/3.5/library [2] /Library/Frameworks/R.framework/Versions/3.5/Resources/library ```
Session Info for second computer ```r > devtools::session_info() ─ Session info ──────────────────────────────────────────────────────────── setting value version R version 3.5.2 (2018-12-20) os macOS High Sierra 10.13.6 system x86_64, darwin15.6.0 ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2019-03-12 ─ Packages ──────────────────────────────────────────────────────────────── package * version date lib source aRxiv 0.5.16 2017-04-28 [1] CRAN (R 3.5.0) askpass 1.1 2019-01-13 [1] CRAN (R 3.5.2) assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0) backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0) bib2df * 1.0.1 2018-06-02 [1] CRAN (R 3.5.2) bibtex 0.4.2 2017-06-30 [1] CRAN (R 3.5.0) callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.0) cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.0) colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0) crminer 0.2.0 2018-10-15 [1] CRAN (R 3.5.0) crul 0.7.0 2019-01-04 [1] CRAN (R 3.5.2) curl 3.3 2019-01-10 [1] CRAN (R 3.5.2) data.table * 1.12.0 2019-01-13 [1] CRAN (R 3.5.2) desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0) devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.2) digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0) dplyr 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2) DT 0.5 2018-11-05 [1] CRAN (R 3.5.0) fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.0) fulltext * 1.2.0 2019-01-22 [1] CRAN (R 3.5.2) ggplot2 3.1.0 2018-10-25 [1] CRAN (R 3.5.0) glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0) gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.0) hoardr 0.5.2 2018-12-02 [1] CRAN (R 3.5.0) htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0) htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.0) httpcode 0.2.0 2016-11-14 [1] CRAN (R 3.5.0) httpuv 1.4.5.1 2018-12-18 [1] CRAN (R 3.5.0) httr 1.4.0 2018-12-11 [1] CRAN (R 3.5.0) humaniformat 0.6.0 2016-04-24 [1] CRAN (R 3.5.0) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.0) later 0.8.0 2019-02-11 [1] CRAN (R 3.5.2) lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.0) lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.0) magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0) microdemic 0.4.0 2018-10-25 [1] CRAN (R 3.5.0) mime 0.6 2018-10-05 [1] CRAN (R 3.5.0) miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 3.5.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0) pdftools 2.2 2019-03-10 [1] CRAN (R 3.5.2) pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0) pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0) pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0) plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0) processx 3.3.0 2019-03-10 [1] CRAN (R 3.5.2) promises 1.0.1 2018-04-13 [1] CRAN (R 3.5.0) ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0) purrr * 0.3.1 2019-03-03 [1] CRAN (R 3.5.2) qpdf 1.1 2019-03-07 [1] CRAN (R 3.5.2) R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2) rappdirs 0.3.1 2016-03-28 [1] CRAN (R 3.5.0) Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0) rcrossref 0.9.0 2019-01-14 [1] CRAN (R 3.5.2) remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0) rentrez 1.2.1 2018-03-05 [1] CRAN (R 3.5.0) reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.5.0) rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) rplos 0.8.4 2018-08-14 [1] CRAN (R 3.5.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0) scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0) shiny 1.2.0 2018-11-02 [1] CRAN (R 3.5.0) solrium 1.0.2 2018-12-13 [1] CRAN (R 3.5.0) storr 1.2.1 2018-10-18 [1] CRAN (R 3.5.0) stringi 1.3.1 2019-02-13 [1] CRAN (R 3.5.2) stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.2) tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0) triebeard 0.3.0 2016-08-04 [1] CRAN (R 3.5.0) urltools 1.7.2 2019-02-04 [1] CRAN (R 3.5.2) usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0) whisker 0.3-2 2013-04-28 [1] CRAN (R 3.5.0) withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0) XML 3.98-1.19 2019-03-06 [1] CRAN (R 3.5.2) xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0) xtable 1.8-3 2018-08-29 [1] CRAN (R 3.5.0) [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library ```
sckott commented 5 years ago

The 2 DOIs you aren't getting full text for might be a Elsevier "fence" issue. What institution is your access through?

LMAllenJacobson commented 5 years ago

The University of Florida

sckott commented 5 years ago

email sent

LMAllenJacobson commented 5 years ago

Thanks! Is that something I could do if I change institutions?

LMAllenJacobson commented 5 years ago

Regarding the "Error in sprintf(pat, x) : too few arguments" With some help, I've determined that one reference from the journal of experimental biology results in this error. If I remove this one reference, ft_get() runs.

Remove problem reference and retreive full texts

bib_ft <- ft_get(bib_df[-c(64), DOI], progress = TRUE, try_unknown = TRUE) 
(ref64 <- bib_df[c(64), DOI])
[1] "10.1242/jeb.00394"

build summary table

retrieved <- stack((map(bib_ft, 1)))
dois_queried <- map(bib_ft, 2)
queried <- stack(lapply(dois_queried, function(x) length(x)))
summary <- merge(x = retrieved, y = queried, all=TRUE, by = "ind")
setnames(summary, old=c("values.x","values.y"), new=c("retrieved", "queried"))
source retrieved queried
aaas 0 2
american_physiological_society 2 2
brill 0 1
cambridge_university_press_cup 0 1
canadian_science_publishing 4 7
elsevier 36 36
inter_research_science_center 3 3
jstor 0 10
oxford_university_press_oup 0 8
springer_nature 20 20
the_royal_society 2 2
university_of_california_press 0 1
university_of_chicago_press 5 20
wiley 0 20

notes:

  1. I was able to retrieve both articles from the royal society.

  2. Of the retrieved articles, 36 are XML and all XMLs only include meta data. This is the same number of articles retrieved from elsevier. I haven't confirmed if all 36 are from elsevier. Hopefully, I will retrieve full-text XMLs when the fence is removed.

report warnings

warnings <- bind_rows(lapply(bib_ft, "[[", "errors"), .id = "publisher") %>% 
  filter(!is.na(error))
Warnings journal|DOI|warning ------------------------------|----|---- jstor|10.2307/3543214|no link found from Crossref jstor|10.2307/1563678|no link found from Crossref jstor|10.2307/2389690|no link found from Crossref jstor|10.2307/1444056|no link found from Crossref jstor|10.2307/3543971|no link found from Crossref jstor|10.2307/2976|no link found from Crossref jstor|10.2307/3544118|no link found from Crossref jstor|10.2307/2390277|no link found from Crossref jstor|10.2307/1446195|no link found from Crossref jstor|10.2307/1446195|no link found from Crossref canadian_science_publishing|10.1139/f73-068|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` canadian_science_publishing|10.1139/f71-253|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` canadian_science_publishing|10.1139/f72-270|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.63.3.30156228|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.45.1.30155926|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/515917|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.55.2.30155850|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.66.1.30158293|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.37.4.30152756|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.39.1.30152763|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/639616|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.46.4.30155609|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.52.1.30159931|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.32.1.30152287|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/639605|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.17.1.30151829|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.41.4.30155477|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` university_of_chicago_press|10.1086/physzool.63.6.30152639|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` aaas|10.1126/science.134.3495.2033|type was supposed to be `pdf`, but was `text/html; charset=utf-8` aaas|10.1126/science.66.1709.289|type was supposed to be `pdf`, but was `text/html; charset=utf-8` oxford_university_press__oup_|10.1093/jn/5.6.581|Recv failure: Operation timed out oxford_university_press__oup_|10.1093/jn/121.suppl_11.s18|Recv failure: Operation timed out oxford_university_press__oup_|10.1093/jn/121.suppl_11.s8|Recv failure: Connection reset by peer oxford_university_press__oup_|10.1093/jn/3.2.177|Recv failure: Operation timed out oxford_university_press__oup_|10.1093/jn/8.2.139|Recv failure: Operation timed out oxford_university_press__oup_|10.1093/jn/18.5.473|Recv failure: Operation timed out oxford_university_press__oup_|10.1093/icb/9.2.418|Recv failure: Operation timed out oxford_university_press__oup_|10.2527/jas1976.433692x|Recv failure: Operation timed out springer_nature|10.1007/s003600050220|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` springer_nature|10.1007/bf00344853|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` springer_nature|10.1007/bf01875448|(503) Service Unavailable springer_nature|10.1007/bf00379996|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` springer_nature|10.1007/bf00684448|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` springer_nature|10.2307/1350546|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` springer_nature|10.1007/bf00297958|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` springer_nature|10.1007/bf00346410|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` springer_nature|10.1007/bf00386903|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` springer_nature|10.1007/bf00592305|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` springer_nature|10.1007/bf00346295|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` springer_nature|10.1007/bf00345740|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` springer_nature|10.1007/bf00738417|type was supposed to be `pdf`, but was `text/html; charset=UTF-8` wiley|10.1111/j.1365-2427.1982.tb00620.x|type was supposed to be `pdf`, but was `text/plain` wiley|10.4319/lo.1977.22.1.0108|type was supposed to be `pdf`, but was `text/plain` wiley|10.1002/jez.1402500215|wrong args for environment subassignment wiley|10.2307/1937827|wrong args for environment subassignment wiley|10.1111/j.1095-8649.1978.tb03426.x|type was supposed to be `pdf`, but was `text/plain` wiley|10.1111/j.1095-8649.1985.tb04017.x|type was supposed to be `pdf`, but was `text/plain` wiley|10.1111/j.1744-7348.1958.tb02226.x|type was supposed to be `pdf`, but was `text/plain` wiley|10.2307/1935183|wrong args for environment subassignment wiley|10.4319/lo.1991.36.2.0354|type was supposed to be `pdf`, but was `text/plain` wiley|10.1111/j.1365-3032.1997.tb01176.x|type was supposed to be `pdf`, but was `text/plain` wiley|10.1111/j.1095-8649.2000.tb00272.x|type was supposed to be `pdf`, but was `text/plain` wiley|10.1046/j.1095-8649.2003.00048.x|type was supposed to be `pdf`, but was `text/plain` wiley|10.2307/1942479|wrong args for environment subassignment wiley|10.2307/1933575|wrong args for environment subassignment wiley|10.1111/j.1095-8649.1991.tb03136.x|type was supposed to be `pdf`, but was `text/plain` wiley|10.4319/lo.1971.16.1.0086|type was supposed to be `pdf`, but was `text/plain` wiley|10.2307/1936532|wrong args for environment subassignment wiley|10.4319/lo.1978.23.3.0461|type was supposed to be `pdf`, but was `text/plain` wiley|10.2307/1933448|wrong args for environment subassignment wiley|10.1111/j.1095-8649.2004.00374.x|type was supposed to be `pdf`, but was `text/plain` university_of_california_press|10.2307/4444807|no link found from Crossref brill|10.1163/187529274x00591|type was supposed to be `pdf`, but was `text/html` cambridge_university_press__cup_|10.1079/bjn19720046|no link found from Crossref
Session Info ```r > devtools::session_info() ─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.5.2 (2018-12-20) os macOS High Sierra 10.13.6 system x86_64, darwin15.6.0 ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2019-03-12 ─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source aRxiv 0.5.16 2017-04-28 [1] CRAN (R 3.5.0) assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0) backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0) bib2df * 1.0.1 2018-06-02 [1] CRAN (R 3.5.2) bibtex 0.4.2 2017-06-30 [1] CRAN (R 3.5.0) callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.0) cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.0) colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0) crul 0.7.0 2019-01-04 [1] CRAN (R 3.5.2) curl 3.3 2019-01-10 [1] CRAN (R 3.5.2) data.table * 1.12.0 2019-01-13 [1] CRAN (R 3.5.2) desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0) devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.2) digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0) dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2) DT 0.5 2018-11-05 [1] CRAN (R 3.5.0) fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.0) fulltext * 1.2.0 2019-01-22 [1] CRAN (R 3.5.2) ggplot2 3.1.0 2018-10-25 [1] CRAN (R 3.5.0) glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0) gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.0) hoardr 0.5.2 2018-12-02 [1] CRAN (R 3.5.0) htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0) htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.0) httpcode 0.2.0 2016-11-14 [1] CRAN (R 3.5.0) httpuv 1.4.5.1 2018-12-18 [1] CRAN (R 3.5.0) httr 1.4.0 2018-12-11 [1] CRAN (R 3.5.0) humaniformat 0.6.0 2016-04-24 [1] CRAN (R 3.5.0) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.0) knitr 1.22 2019-03-08 [1] CRAN (R 3.5.2) later 0.8.0 2019-02-11 [1] CRAN (R 3.5.2) lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.0) lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.0) magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0) microdemic 0.4.0 2018-10-25 [1] CRAN (R 3.5.0) mime 0.6 2018-10-05 [1] CRAN (R 3.5.0) miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 3.5.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0) pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0) pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0) pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0) plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0) processx 3.3.0 2019-03-10 [1] CRAN (R 3.5.2) promises 1.0.1 2018-04-13 [1] CRAN (R 3.5.0) ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0) purrr * 0.3.1 2019-03-03 [1] CRAN (R 3.5.2) R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2) rappdirs 0.3.1 2016-03-28 [1] CRAN (R 3.5.0) Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0) rcrossref 0.9.0 2019-01-14 [1] CRAN (R 3.5.2) remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0) rentrez 1.2.1 2018-03-05 [1] CRAN (R 3.5.0) reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.5.0) rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) rplos 0.8.4 2018-08-14 [1] CRAN (R 3.5.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0) rstudioapi 0.9.0 2019-01-09 [1] CRAN (R 3.5.2) scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0) shiny 1.2.0 2018-11-02 [1] CRAN (R 3.5.0) solrium 1.0.2 2018-12-13 [1] CRAN (R 3.5.0) storr 1.2.1 2018-10-18 [1] CRAN (R 3.5.0) stringi 1.3.1 2019-02-13 [1] CRAN (R 3.5.2) stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.2) tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0) usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0) whisker 0.3-2 2013-04-28 [1] CRAN (R 3.5.0) withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0) xfun 0.5 2019-02-20 [1] CRAN (R 3.5.2) XML 3.98-1.19 2019-03-06 [1] CRAN (R 3.5.2) xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0) xtable 1.8-3 2018-08-29 [1] CRAN (R 3.5.0) yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.0) [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library ```
sckott commented 5 years ago

regarding the fence issue they replied, but they want to see the API key to help diagnose. You could email to me (dont share it here), or I could include you in the thread and then you can share with just them. up to you

sckott commented 5 years ago

Is that something I could do if I change institutions?

what is "that"?

LMAllenJacobson commented 5 years ago

if I change institutions, can I fix the fence issue on my own (maybe by contacting my new library)? Or, should I create a new issue at that time?

LMAllenJacobson commented 5 years ago

Sorry I never answered your earlier question regarding Wiley and my Crossref TDM key. I think I have it set up correctly. I logged in to the cross ref clickthrough service using my ORCID id, accepted the 3 publisher-specific agreements (Zhejiang, Esevier, and Wiley), copied my API token, and stored my token and my e.mail my r environment

crossref_email <-  "institutional e.mail"
CROSSREF_TDM <- "cross ref api token"

Could a "fence" prevent access to Wiley?

sckott commented 5 years ago

On vacation. Back Monday

sckott commented 5 years ago

if I change institutions, can I fix the fence issue on my own (maybe by contacting my new library)? Or, should I create a new issue at that time?

You can definitely try. I email integrationsupport@elsevier.com to ask about issues with their articles/APIs. You could email them yourself.

As far as I know, fences are only an Elsevier thing. I'm not sure what's preventing Wiley access in this case.

joshuachristie commented 4 years ago

I'm having a similar problem with Wiley (not able to download any pdfs). In my case, it seems to be that ft_get is not accessing the api link (i.e. it tries the https://onlinelibrary.wiley.com/ link not the https://api.wiley.com/onlinelibrary/ link). For example, for doi 10.1111/evo.13812, I get the following pdf links from the crossref api:

0 |   URL | "https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1111%2Fevo.13812" content-type | "application/pdf" content-version | "vor" intended-application | "text-mining" 1 |   URL | "https://onlinelibrary.wiley.com/doi/pdf/10.1111/evo.13812" content-type | "application/pdf" content-version | "vor" intended-application | "text-mining"

but when running ft_get(doi, verbose = TRUE) it goes to

< Location: https://onlinelibrary.wiley.com/doi/pdf/10.1111/evo.13812?cookieSet=1

(which it tries multiple times, but always this link and not the api one).

While I can access that link directly in the browser, I can't access it via curl/wget, etc. But I can download the pdf fine using curl if I use the api link. My crossref TDM token is set correctly in ft_get (correctly shows up as CR-Clickthrough-Client-Token and I can download Elsevier papers fine).

sckott commented 4 years ago

thanks, will have a look

sckott commented 4 years ago

@joshuachristie try again after reinstalling remotes::install_github("ropensci/fulltext") - Wiley is now giving those api.wiley.com URLs, as far as I know they didn't used to give those urls, but only the onlinelibrary ones. Now we're using those api.wiley.com urls

joshuachristie commented 4 years ago

Yep working now - thanks so much for the quick fix!

sckott commented 4 years ago

oh, and you may have noticed - wiley now has xml, so by default you get xml unless you ask for pdf