ropensci-archive / crminer

:no_entry: ARCHIVED :no_entry: Fetch 'Scholary' Full Text from 'Crossref'
Other
17 stars 5 forks source link

oxford full text issue #41

Closed fangzhou-xie closed 4 years ago

fangzhou-xie commented 4 years ago
Session Info ```r > sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Catalina 10.15.4 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.6.3 tools_3.6.3 ```

Sorry for my repeated posting issues. This time I am working on journals from Oxford Press.

> l <- crm_links("10.1093/icc/4.1.1-a")
> l
$unspecified
<url> http://academic.oup.com/icc/article-pdf/4/1/1/6768751/4-1-1b.pdf
> crm_text(l, "pdf", overwrite_unspecified = T)
Downloading pdf...
Error in curl::curl_fetch_disk(x$url$url, x$disk, handle = x$url$handle) : 
  Recv failure: Operation timed out

I can confirm that I can open this link in a browser, but calling crm_text() function will throw timeout error. I tried to use curl -o in the terminal but was having the same timeout error.

I then tried to run RSelenium browser and fetch that full-text link. It displayed the article (in PDF) properly in the automated chromedriver.

library(RSelenium)
browser <- remoteDriver(port = 5556, browserName="chrome")
browser$open()
browser$navigate( as.character(l$unspecified))

I think that their server has some JavaScript testing, and the curl-based HTTP requests will fail. (I am not very familiar with this in R, but I guess it is the same as the Python "requests" package that they cannot deal with dynamic-rendered elements.) I believe the current work-around would be using RSelenium, download the PDF, and then extract plain text from it.

I wonder if there are better methods to deal with this without using Selenium?

sckott commented 4 years ago

no need to apologize, more issues is always better.

this appears to be a case of simply pretending to be a browser. they appear to be looking for a browser like user agent string. try

ua <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
z <- crm_links("10.1093/icc/4.1.1-a")
crm_text(z, "pdf", overwrite_unspecified = TRUE, verbose = TRUE, useragent = ua)

that ua string is the one i copied from my browser's devtools when requesting the url for the article - i don't think the exact versions are important, i imagine they are using some kind of regex

fangzhou-xie commented 4 years ago

Thank you so much! However, I also found that this solution works for a small number of articles, but fails up to a certain point. (As you may have noticed, I am getting lots of articles. )

Screenshot 2020-05-03 13 17 17

I can't find which link I used to take this screenshot, but it for sure happened for many articles. I simply took one link from the error messages in my console, and copy it into my browser to get this. As you can see in this screenshot, we are challenged by the Turing test. After clicking and proceed, I got 403 error.

FYI, when I tried RSelenium last night, this captcha challenge also occurred once in a while. I guess this has to do with my IP address, but I can't change it (because of the institutional access I need).

Let me try to find some reproducible examples of this.

fangzhou-xie commented 4 years ago

I tried to slow the program by setting sleeping time in between and it seems that I no longer face the human-machine challenge problem. However, I found something else here.

> link <- crm_links("10.1093/reseval/rvv030")
> ua <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
> crm_text(link, "pdf", overwrite_unspecified = T, useragent=ua)
Downloading pdf...
Extracting text from pdf...
PDF error: May not be a PDF file (continuing anyway)
PDF error (2): Illegal character <21> in hex string
PDF error (4): Illegal character <4f> in hex string
....
PDF error (197): Illegal character <74> in hex string
PDF error (198): Illegal character <6c> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.

I believe this is connected to #40, where the issue is on the extraction part.

sckott commented 4 years ago

Right, I was going to say that captcha thing is rate limiting related, and probably best treated by sleeping in between requests, which you already did.

I get the same error - i'll look into that one

sckott commented 4 years ago

forgot to ping this issue i think. but made change internally to error better - in this case its not accessible and we can find a not logged in error in the resulting html as below. we now try to find that error and delete the file afterwards before returning

xx = crm_text(url=link, type="pdf", overwrite_unspecified = TRUE, useragent=ua)
Error: error in pdf retrieval; attempted to extract error messages:

            You could not be signed in. Please check your email address / username and password and try again.

            You could not be signed in. Please check your email address / username and password and try again

there's two instances of the error message in the html unfortunately, we could do more massaging to make the message better, but its at the risk of making the code fragile

fangzhou-xie commented 4 years ago

Thank you so much for your help!

fangzhou-xie commented 4 years ago

A quick update: it seems to me that Oxford is using IP address to detect whether you request too much, while Wiley is restricting on Crossref API key (60 articles/6 minutes).

For Oxford, as long as we wait between calls, we should be fine. But for Wiley, the quota will not be replenished until the 6-minute cycle.

sckott commented 4 years ago

Thanks for the update. I should add to the docs something about Oxford. What is the sleep time used?

fangzhou-xie commented 4 years ago

Currently, I am using 3 seconds. But I haven't carefully examined whether this is the best.

sckott commented 4 years ago

thanks, that's helpful