cambridge full text issue

fangzhou-xie commented 4 years ago

Session Info

```r R version 3.6.3 (2020-02-29) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Catalina 10.15.4 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] crminer_0.3.3.93 loaded via a namespace (and not attached): [1] hoardr_0.5.2 compiler_3.6.3 R6_2.4.1 tools_3.6.3 httpcode_0.3.0 curl_4.3 [7] rappdirs_0.3.1 Rcpp_1.0.4.6 urltools_1.7.3 pdftools_2.3 triebeard_0.3.0 crul_0.9.0 [13] qpdf_1.1 jsonlite_1.6.1 digest_0.6.25 askpass_1.1 ```

> doi <- "10.1017/s0081305200012255"
> link <- crm_links(doi)
> crm_text(link)
Error in crm_text.list(link) : no links for type xml
> link
$unspecified
<url> https://www.cambridge.org/core/services/aop-cambridge-core/content/view/S0081305200012255

> crm_text(link, "pdf", overwrite_unspecified = T)
Downloading pdf...
Extracting text from pdf...
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.

I think this is connected to #41 , #40 ?

sckott commented 4 years ago

if you go to the article page and try to get he pdf it is somehow malformed. i dont know if its just this article or many in this journal or publisher.

So I don't think there's much we can do there - though we should fail better and remove the bad pdf file as it does stick around after the read failure

sckott commented 4 years ago

added some more error handling for this case, try to detect malformed pdfs now - not sure how robust the solution is until we run into more cases of malformed pdfs. the behavior now with the latest commit:

doi <- "10.1017/s0081305200012255"
link <- crm_links(doi)
crm_text(link, type="pdf", overwrite_unspecified = TRUE)
#> Error: malformed pdf detected; contact publisher, see if they can fix

ropensci-archive / crminer

cambridge full text issue #45