Closed fangzhou-xie closed 4 years ago
no need to apologize, more issues is always better.
this appears to be a case of simply pretending to be a browser. they appear to be looking for a browser like user agent string. try
ua <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
z <- crm_links("10.1093/icc/4.1.1-a")
crm_text(z, "pdf", overwrite_unspecified = TRUE, verbose = TRUE, useragent = ua)
that ua string is the one i copied from my browser's devtools when requesting the url for the article - i don't think the exact versions are important, i imagine they are using some kind of regex
Thank you so much! However, I also found that this solution works for a small number of articles, but fails up to a certain point. (As you may have noticed, I am getting lots of articles. )
I can't find which link I used to take this screenshot, but it for sure happened for many articles. I simply took one link from the error messages in my console, and copy it into my browser to get this. As you can see in this screenshot, we are challenged by the Turing test. After clicking and proceed, I got 403 error.
FYI, when I tried RSelenium last night, this captcha challenge also occurred once in a while. I guess this has to do with my IP address, but I can't change it (because of the institutional access I need).
Let me try to find some reproducible examples of this.
I tried to slow the program by setting sleeping time in between and it seems that I no longer face the human-machine challenge problem. However, I found something else here.
> link <- crm_links("10.1093/reseval/rvv030")
> ua <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
> crm_text(link, "pdf", overwrite_unspecified = T, useragent=ua)
Downloading pdf...
Extracting text from pdf...
PDF error: May not be a PDF file (continuing anyway)
PDF error (2): Illegal character <21> in hex string
PDF error (4): Illegal character <4f> in hex string
....
PDF error (197): Illegal character <74> in hex string
PDF error (198): Illegal character <6c> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.
I believe this is connected to #40, where the issue is on the extraction part.
Right, I was going to say that captcha thing is rate limiting related, and probably best treated by sleeping in between requests, which you already did.
I get the same error - i'll look into that one
forgot to ping this issue i think. but made change internally to error better - in this case its not accessible and we can find a not logged in error in the resulting html as below. we now try to find that error and delete the file afterwards before returning
xx = crm_text(url=link, type="pdf", overwrite_unspecified = TRUE, useragent=ua)
Error: error in pdf retrieval; attempted to extract error messages:
You could not be signed in. Please check your email address / username and password and try again.
You could not be signed in. Please check your email address / username and password and try again
there's two instances of the error message in the html unfortunately, we could do more massaging to make the message better, but its at the risk of making the code fragile
Thank you so much for your help!
A quick update: it seems to me that Oxford is using IP address to detect whether you request too much, while Wiley is restricting on Crossref API key (60 articles/6 minutes).
For Oxford, as long as we wait between calls, we should be fine. But for Wiley, the quota will not be replenished until the 6-minute cycle.
Thanks for the update. I should add to the docs something about Oxford. What is the sleep time used?
Currently, I am using 3 seconds. But I haven't carefully examined whether this is the best.
thanks, that's helpful
Session Info
```r > sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Catalina 10.15.4 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.6.3 tools_3.6.3 ```Sorry for my repeated posting issues. This time I am working on journals from Oxford Press.
I can confirm that I can open this link in a browser, but calling
crm_text()
function will throw timeout error. I tried to usecurl -o
in the terminal but was having the same timeout error.I then tried to run
RSelenium
browser and fetch that full-text link. It displayed the article (in PDF) properly in the automatedchromedriver
.I think that their server has some JavaScript testing, and the curl-based HTTP requests will fail. (I am not very familiar with this in R, but I guess it is the same as the Python "requests" package that they cannot deal with dynamic-rendered elements.) I believe the current work-around would be using
RSelenium
, download the PDF, and then extract plain text from it.I wonder if there are better methods to deal with this without using Selenium?