ropensci-archive / fulltext

:warning: ARCHIVED :warning: Search across and get full text for OA & closed journals
Other
271 stars 46 forks source link

OA Wiley ePDF not downloading #189

Closed bomeara closed 5 years ago

bomeara commented 5 years ago

I'm trying to get an open access article from a Wiley journal. I have tried this both on a laptop at home and desktop at work (the latter has access to paywalled articles, too).

fulltext::ft_get("10.3732/AJB.1700190")

and the return is

<fulltext text>
[Docs] 0
[Source] ext - /Users/bomeara/Library/Caches/R/fulltext
[IDs] 10.3732/AJB.1700190 ...
Warning message:
you may not have access to 10.3732/AJB.1700190
 or an error occurred
 or the downloaded file was invalid

The article is open access: https://bsapubs.onlinelibrary.wiley.com/doi/full/10.3732/ajb.1700190

but the PDF link goes to an epdf: https://bsapubs.onlinelibrary.wiley.com/doi/epdf/10.3732/ajb.1700190

However, dropping the e from the url brings us to a nice PDF: https://bsapubs.onlinelibrary.wiley.com/doi/pdf/10.3732/ajb.1700190

Session Info ```r ─ Session info ─────────────────────────────────────────────────────────────── setting value version R version 3.5.2 (2018-12-20) os macOS Mojave 10.14 system x86_64, darwin15.6.0 ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2019-01-16 ─ Packages ─────────────────────────────────────────────────────────────────── package * version date lib source aRxiv 0.5.16 2017-04-28 [1] CRAN (R 3.5.0) assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0) backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0) bibtex 0.4.2 2017-06-30 [1] CRAN (R 3.5.0) bindr 0.1.1 2018-03-13 [1] CRAN (R 3.5.0) bindrcpp 0.2.2 2018-03-29 [1] CRAN (R 3.5.0) callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.0) cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.0) colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0) crul 0.7.0 2019-01-04 [1] CRAN (R 3.5.2) curl 3.3 2019-01-10 [1] CRAN (R 3.5.2) desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0) devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.1) digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0) dplyr 0.7.8 2018-11-10 [1] CRAN (R 3.5.0) DT 0.5 2018-11-05 [1] CRAN (R 3.5.0) fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.1) fulltext * 1.1.0.9233 2019-01-17 [1] Github (ropensci/fulltext@7292954) ggplot2 3.1.0 2018-10-25 [1] CRAN (R 3.5.0) glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0) gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.0) hoardr 0.5.2 2018-12-02 [1] CRAN (R 3.5.0) htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0) htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.0) httpcode 0.2.0 2016-11-14 [1] CRAN (R 3.5.0) httpuv 1.4.5.1 2018-12-18 [1] CRAN (R 3.5.0) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.1) later 0.7.5 2018-09-18 [1] CRAN (R 3.5.0) lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.0) lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.0) magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0) microdemic 0.4.0 2018-10-25 [1] CRAN (R 3.5.0) mime 0.6 2018-10-05 [1] CRAN (R 3.5.0) miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 3.5.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0) pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0) pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0) pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0) plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0) processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.0) promises 1.0.1 2018-04-13 [1] CRAN (R 3.5.0) ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0) purrr 0.2.5 2018-05-29 [1] CRAN (R 3.5.0) R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.0) rappdirs 0.3.1 2016-03-28 [1] CRAN (R 3.5.0) Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0) rcrossref 0.9.0 2019-01-14 [1] CRAN (R 3.5.2) remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0) rentrez 1.2.1 2018-03-05 [1] CRAN (R 3.5.0) reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.5.0) rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) rplos 0.8.4 2018-08-14 [1] CRAN (R 3.5.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0) scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0) shiny 1.2.0 2018-11-02 [1] CRAN (R 3.5.0) solrium 1.0.2 2018-12-13 [1] CRAN (R 3.5.0) storr 1.2.1 2018-10-18 [1] CRAN (R 3.5.1) stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.0) stringr 1.3.1 2018-05-10 [1] CRAN (R 3.5.0) testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.0) tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0) triebeard 0.3.0 2016-08-04 [1] CRAN (R 3.5.0) urltools 1.7.1 2018-08-03 [1] CRAN (R 3.5.0) usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0) whisker 0.3-2 2013-04-28 [1] CRAN (R 3.5.0) withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0) XML 3.98-1.16 2018-08-19 [1] CRAN (R 3.5.0) xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0) xtable 1.8-3 2018-08-29 [1] CRAN (R 3.5.0) [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library ```
bomeara commented 5 years ago

Note that on digging down to here in the code I thought it might be my lack of CROSSREF_TDM (after reading the fine manual on this), so I fixed that, and the problem persists.

bomeara commented 5 years ago

@rossmounce's blog post on avoiding ReadCube to get PDFs directly might be relevant for this problem, too (though CR-Clickthrough-Client-Token in the header argument of get_ft(), seems like it should do this, but I really don't know what I'm doing).

sckott commented 5 years ago

thanks for the report @bomeara

i do ge tthe same thing.

when running it without my UC berkely vpn on you can dig into the result and see

x <- ft_get("10.3732/AJB.1700190")
x$wiley$errors
#>                     id                                                             error
#>  1 10.3732/AJB.1700190 type was supposed to be `pdf`, but was `text/html; charset=UTF-8`

With VPN on I get the same thing.


If you use verbose=TRUE, ft_get("10.3732/AJB.1700190", verbose=TRUE) (verbose will also let you see if your click through token is being used and what content type [pdf, xml, html, plain text] is being requested) it takes you threw a ton of redirects and finally ends up requesting this URL

https://bsapubs.onlinelibrary.wiley.com/doi/full/10.3732/ajb.1700190

which is just the html page for the article. Helpful, right?

Isn't text mining fun?

For Wiley we go to Crossref to get the full text link, but the link Wiley voluntarily gives to Crossref is for the html version, not pdf (Wiley doesn't give XML even though they have it, nice hugh).

Anywho, this works, let me know if it works for you or not.

library(crul)
cli = HttpClient$new(url="https://bsapubs.onlinelibrary.wiley.com/doi/pdf/10.3732/ajb.1700190", headers = list(Accept = "application/pdf", "CR-Clickthrough-Client-Token" = "<your token>"), opts = list(followlocation=1))
f <- tempfile()
res <- cli$get(disk = f)
pdftools::pdf_text(f)

Now we need to see if we can get the link redirects to get to that URL that works .... will look into it more

bomeara commented 5 years ago

Yep, works beautifully (even stupidly including "CR-Clickthrough-Client-Token" = "<your token>" verbatim). Thank you.

I suppose a kludgy solution could be gsub("full", "pdf", url) for the path from crossref, but seems too much of a hack.

sckott commented 5 years ago

oh right, it probably shouldn't need the token since it's OA

sckott commented 5 years ago

that may be worth a try, but we can't expect wiley to do anything in a consistent manner across all journals

bomeara commented 5 years ago

But I imagine there are other wiley ePDFs where token does matter [context: pulling last 1000 articles from American Journal of Botany to put into global names/phylotastic taxonomic name extraction].

sckott commented 5 years ago

But I imagine there are other wiley ePDFs where token does matter

agree

bomeara commented 5 years ago

Sample of what's working now: the number of saved PDFs out of 1000 DOIs went from ~600 using CacheAllPDFsImmediately() to 985 using the crul workaround (all part of a longer drake workflow).

#' Download all no looping
#'
#' They are all saved into data/pdfcache
#'
#' @param references.df data.frame from GetAllReferences
#' @return list of information
CacheAllPDFsImmediately <- function(references.df) {
  fulltext::cache_options_set(full_path="/Users/bomeara/Documents/MyDocuments/GitClones/ReturnOfTheMinivan/data/pdfcache")

  cache_all <- fulltext::ft_get(rev(references.df$DI))
  return(cache_all)
}

#' Hack to get remaining ones
#'
#' Wiley sometimes wants to give ePDFs. Curse them.
#'
#' Thanks to Scott Chamberlain for the crul workaround
#'
#' @param cache_all output of CacheAllPDFsImmediately
#' @return Paths to all PDFs
CacheRemainingPDFs <- function(cache_all) {
  paths <- cache_all$wiley$data$path
  full_path="/Users/bomeara/Documents/MyDocuments/GitClones/ReturnOfTheMinivan/data/pdfcache"
  for (i in sequence(length(paths))) {
    if(is.null(paths[[i]]$type)) {
      output_file <- paste0(full_path, "/", gsub("\\.", "_", gsub('/', "_", paths[[i]]$id)), ".pdf")
      if(!file.exists(output_file)) {
        print(output_file)
        cli = crul::HttpClient$new(url=paste0("https://bsapubs.onlinelibrary.wiley.com/doi/pdf/", paths[[i]]$id), headers = list(Accept = "application/pdf", "CR-Clickthrough-Client-Token" =  Sys.getenv("CROSSREF_TDM")), opts = list(followlocation=1))
        try(res <- cli$get(disk=output_file))
      }
    }
  }
  return(list.files(path=full_path, full.names=TRUE))
}
sckott commented 5 years ago

can you share those other 15 DOIs that fail?

sckott commented 5 years ago

okay, if you reinstall that eg should work now. its sort of a hack, where if we fail on the first attempt, then change the url to the pdf version and try again, and if that fails then we give up

sckott commented 5 years ago

@bomeara let me know if the change works for you

bomeara commented 5 years ago

Thanks. The change got a few hundred more using fulltext (and the crul hack got nearly everything I need, so my urgency for this working is much lower). But there were still some refs that it didn't get: the first 50:

10.3732/AJB.1300167
10.3732/AJB.1300241
10.3732/AJB.1300351
10.3732/AJB.1300223
10.3732/AJB.1300320
10.3732/AJB.1300053
10.3732/AJB.1300284
10.3732/AJB.1400172
10.3732/AJB.1400071
10.3732/AJB.1400224
10.3732/AJB.1400088
10.3732/AJB.1300388
10.3732/AJB.1400232
10.3732/AJB.1400267
10.3732/AJB.1400135
10.3732/AJB.1400120
10.3732/AJB.1400177
10.3732/AJB.1400156
10.3732/AJB.1400225
10.3732/AJB.1400290
10.3732/AJB.1400312
10.3732/AJB.1400198
10.3732/AJB.1400190
10.3732/AJB.1400262
10.3732/AJB.1400248
10.3732/AJB.1400050
10.3732/AJB.1400214
10.3732/AJB.1400125
10.3732/AJB.1400256
10.3732/AJB.1400252
10.3732/AJB.1400412
10.3732/AJB.1400422
10.3732/AJB.1400210
10.3732/AJB.1400264
10.3732/AJB.1400351
10.3732/AJB.1400036
10.3732/AJB.1400317
10.3732/AJB.1400377
10.3732/AJB.1400403
10.3732/AJB.1400484
10.3732/AJB.1400543
10.3732/AJB.1500990
10.3732/AJB.1400558
10.3732/AJB.1500004
10.3732/AJB.1400509
10.3732/AJB.1500095
10.3732/AJB.1500073
10.3732/AJB.1400431
10.3732/AJB.1500159
10.3732/AJB.1400466
sckott commented 5 years ago

thanks for the list

sckott commented 5 years ago

oh boy, those DOIs that don't work are throwing URLs like

https://syndication.highwire.org/content/doi/10.3732/ajb.1300053

that result in a 403: not authorized - super helpful

sckott commented 5 years ago

Okay, if you reinstall, and I think all those DOIs should work now. I tested the first 20 and now they work.

it's a nasty hack https://github.com/ropensci/fulltext/commit/93b870acec051564163d8a89ddbcbebeced54ea5 but at least works and I don't thin breaks anything