Closed sckott closed 4 years ago
The DOIs have links in their Crossref metadata:
e.g. http://api.crossref.org/works/10.1006/jeth.1993.1066
so I imagine the PDF (for clients with access) would be at
https://api.elsevier.com/content/article/PII:S0022053183710665?httpAccept=application/pdf
sorry to not include further details. - correct, that metadata is available. but when you curl that url (if you have access) you only get the abstract
Ok, I guess it must be an issue with the API (permissions or something else), because the PDF is there for open access articles:
https://api.elsevier.com/content/article/PII:S0370269310012608?httpAccept=application/pdf
Ah ha, should have caught that. You're right, Crossref doesn't give a url for pdf, but that doesn't mean it doesn't exist
thanks @hubgit for the help
x <- crm_links('10.1006/jeth.1993.1066')
out <- crm_text(x, "pdf")
out$text
#> [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
now to work around knowing how to detect if a pdf is a OCR'ed or not
I believe the crm_text
will use pdftools
under the hood to get the full text. I also notice that pdftools
has another function that will do OCR by tesseract (https://cran.r-project.org/web/packages/pdftools/pdftools.pdf). I wonder if it is possible, in this Elsevier case, to get the un-OCR'ed PDF by API and then use pdf_ocr_text
to get texts from there?
This Elsevier problem is not unique to the Journal of Economic Theory. Several other Elsevier journals have the same issue, including:
Journal of Multivariate Analysis:
10.1006/jmva.1993.1001
10.1006/jmva.1993.1017
10.1006/jmva.1997.1714
Journal of Urban Economics:
10.1006/juec.1993.1001
10.1006_juec.1993.1026
and possibly many others as well.
thanks @mark-fangzhou-xie for the further input.
yes, im aware of the ocr ability with pdftools. i'm not sure it'd be a good idea to run pdftools::pdf_ocr_text
automatically for the user within crm_text()
as at least in some example pdfs it takes quite a while to run - especially takes a lot longer than pdftools::pdf_text
thats used right now. I'd lean more towards marking those pdfs that are OCRed somehow - and telling the user to run pdftools::pdf_ocr_text
as a separate step probably.
however, before doing that, it'd be nice to have a robust way of detecting if a PDF has embedded text, or if it's from scanning. Anyone know how to do that? @hubgit ?
Thank you for your reply!
i'm not sure it'd be a good idea to run
pdftools::pdf_ocr_text
automatically for the user withincrm_text()
as at least in some example pdfs it takes quite a while to run
I understand that the OCR process is computationally expensive, but I wonder if it is possible to pass an argument in crm_text()
that will call OCR functionality by pdftools::pdf_ocr_text
? For example, force.ocr = TRUE
to OCR the downloaded PDF. This option could be set FALSE
as the default value and it will not be run automatically, but only for one who really wants to use it (me for example). In our application, we would be very eager to have that.
Thank you very much!
however, before doing that, it'd be nice to have a robust way of detecting if a PDF has embedded text, or if it's from scanning. Anyone know how to do that? @hubgit ?
You could certainly run something to extract the text from the PDF without OCR first, and only run the OCR if there's no content.
thanks @hubgit - that was my initial thought, but I wondered if there some smarter method I didn't know about. i'll do that then
@mark-fangzhou-xie that's probably a good compromise to have a parameter - will have a go soon
@mark-fangzhou-xie that's probably a good compromise to have a parameter - will have a go soon
Thank you very much and I am looking forward to it! Could you please kindly leave a comment here under this thread when it's done? No pressure but I would like to check out this feature as soon as possible.
@mark-fangzhou-xie reinstall, e.g.,
doi <- '10.1006/jeth.1993.1066'
z <- crm_links(doi)
crm_text(z, "pdf", try_ocr = TRUE)
if try_ocr=TRUE
we try to extract regular text first, and if that fails, then we try ocr extraction
note that tesseract prints progress like
Converting page 1 to PII:S0094119083710016_1.png... done!
Converting page 2 to PII:S0094119083710016_2.png... done!
...
which I can't figure out how to suppress, so we're stuck with that for now
we should probably add ability to cache the results of extracting OCRed text since that step takes so long to run
Thank you soooo much! You are really a life-saver!
I have tried the example and it worked very well. It for sure took a while but I think that is the nature of the OCR process (to be computationally expensive).
I agree with the cache idea. I wonder if it worth creating an SQLite database somewhere under the package folder (using RSQLite
for example) and save the OCRed text in the database for later retrieval purposes?
So we do use caching now if you use cache=TRUE
but we just cache files to disk, without any database. I avoided a database b/c I thought users may want to access pdfs/xmls/etc. outside of using this pkg, and even in other prog. languages or other workflows. So plain files on disk I thought provided the lowest barrier to reuse downstream.
Do you care how data is cached? do you use files/data outside of using this package?
I can see that PDFs are cached under ~/Library/Caches/R/crminer
(on my macOS Catalina) and I completely agree with you that users could later use the plain files in other workflows. What I proposed was following your previous comment on storing OCRed results from scanned PDFs, where texts cannot be directly extracted from the plain PDF, and by storing the texts somewhere could potentially save some time later on. Using a database may be somewhat more restrictive, but from my experience, I don't really know what other options are out there to be useful here.
For my application, I only care about the plain texts of the articles and save them as .txt
files anyway. Before you implemented the OCR option, I would have to take the downloaded PDF (for those problematic Elsevier articles) under the cached folder and call OCR functionality myself to get the full texts. Thanks to you, I can get full texts directly by calling the crm_text()
function in one go.
After this full-text-collection process, we will probably move on to Python for further analysis based on the full texts. But again, all we care so far is really the plain text.
@mark-fangzhou-xie Do you care how txt files are formatted? Do you just cat()
the text to a file? or do you put separate pages in different files?
doi <- "...."
link <- crm_links(doi)
fulltext <- crm_text(link, type="pdf", overwrite_unspecified = T)
fulltext <- fulltext$text
# or
fulltext <- crm_text(link, type="plain")
# and save it somewhere
fileConn <- file(savepath)
writeLines(fulltext %>% as.character(), fileConn)
close(fileConn)
By doing so, each individual article (paper) will be saved within the same .txt
file. I am using papers as the atomic element. And no, I don't put different pages into different files.
thanks. i think saving the texts of each paper to a file makes most sense - and makes it easy to work into other workflows like within python or command line tools. SQLite is still a good idea, may consider that if files runs into some issues
I only have one concern for using SQLite though. I currently have 50+ GB cached PDFs under "~/Library/Caches/R/crminer" folder. The SQLite database may as well goes to very large and thus affect its reading/writing performance. If someone collects lots of PDFs using crminer
, they may face performance issue when they have collected a large cache.sqlite
file.
But I guess using any other methods to cache things will have a similar issue. And I doubt if anyone is really collecting that much amount of information.
Thank you!
thanks for that info. Good point about performance if a sqlite gets big. I'm not familiar with sqlite issues at a large scale, we'll look into it if we go that route.
for now i think i'll pursue the text file route
Thank you for your effort in this package (and other packages as well). These are indeed very useful tools for people who wish to study scholarly texts. I hope that my use case could be helpful in testing the usage of this package on a large scale.
Yet there is another issue with Elsevier that I have found.
> link <- crm_links("10.1016/j.intacc.2003.09.001")
> link
$xml
<url> https://api.elsevier.com/content/article/PII:S0020706303000694?httpAccept=text/xml
$plain
<url> https://api.elsevier.com/content/article/PII:S0020706303000694?httpAccept=text/plain
> crm_text(link, "plain", verbose=T)
* Found bundle for host api.elsevier.com: 0x7f861e846bd0 [can pipeline]
* Could pipeline, but not asked to!
* Re-using existing connection! (#1) with host api.elsevier.com
* Connected to api.elsevier.com (34.204.27.83) port 443 (#1)
> GET /content/article/PII:S0020706303000694 HTTP/1.1
Host: api.elsevier.com
User-Agent: libcurl/7.64.1 r-curl/4.3 crul/0.9.0
Accept-Encoding: gzip, deflate
Accept: application/json, text/xml, application/xml, */*
< HTTP/1.1 200 OK
< allow: GET
< Content-Encoding: gzip
< Content-Type: application/json;charset=UTF-8
< Date: Wed, 06 May 2020 15:00:24 GMT
< Last-Modified: Thu, 14 May 2015 09:49:50 GMT
< Server: Apache-Coyote/1.1
< vary: Origin
< WARNING: Unauthorized request results in minimized metadata response.
< X-ELS-APIKey: 7968ea68ad28c4627d768d46292800fe
< X-ELS-ReqId: 25599653-a3d5-4e93-bf10-26801796ad1a
< X-ELS-ResourceVersion: default
< X-ELS-Status: WARNING - Unauthorized request results in minimized metadata response.
< X-ELS-TransId: 97b76663-3247-4bcd-94fe-97c2cb968992
< Content-Length: 520
< Connection: keep-alive
<
* Connection #1 to host api.elsevier.com left intact
I notice there is a warning saying that WARNING: Unauthorized request results in minimized metadata response.
This should not have happened as I for sure have the access to their journal. The result turns out to be a json-like matadata, instead of the full text.
I wonder if this one is connected to the Elsevier first-page problem #43 or there is something special for this journal?
A JSON response seems reasonable for a request that is essentially curl 'https://api.elsevier.com/content/article/PII:S0020706303000694' -H 'Accept: application/json'
.
I don't know if something changed, but neither text/plain
nor application/pdf
seem to be acceptable any more:
curl 'https://api.elsevier.com/content/article/PII:S0370269310012608' -H 'Accept: text/plain'
<service-error><status><statusCode>INVALID_INPUT</statusCode><statusText>View parameter specified in request is not valid</statusText></status></service-error>
curl 'https://api.elsevier.com/content/article/PII:S0370269310012608' -H 'Accept: application/pdf'
<service-error><status><statusCode>INVALID_INPUT</statusCode><statusText>Accept header value 'application/pdf' is restricted</statusText></status></service-error>
its a result of trying to deal with lots of diff publishers, all with different url patterns, and in this case, also DOIs that have been transferred between publishers - so thats fun
Sometimes, journals will be transferred to a different publisher. Previous articles are hosted on the old publisher's website, while newer articles are hosted on new publisher's website.
Example: Review of Financial Economics
> crm_links("10.1002/rfe.1102")
$pdf
<url> https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2Frfe.1102
$pdf
<url> https://onlinelibrary.wiley.com/doi/pdf/10.1002/rfe.1102
$xml
<url> https://onlinelibrary.wiley.com/doi/full-xml/10.1002/rfe.1102
$unspecified
<url> https://onlinelibrary.wiley.com/doi/pdf/10.1002/rfe.1102
> crm_links("10.1016/j.rfe.2016.09.003")
$plain
<url> http://api.elsevier.com/content/article/PII:S1058330015301129?httpAccept=text/plain
$xml
<url> http://api.elsevier.com/content/article/PII:S1058330015301129?httpAccept=text/xml
$unspecified
<url> https://onlinelibrary.wiley.com/doi/full/10.1016/j.rfe.2016.09.003
It used to be an Elsevier journal, but now a Wiley one.
That example should work now.
@mark-fangzhou-xie Added caching for texts extracted from pdfs - for both those from normal pdfs and from ocr'ed pdfs. Also removed the cache
parameter, seemed not useful, so all requests will use cached files if they exist.
Thank you! This feature is really cool.
I just found another example, and this time the parsing PDF failure happend for Elsevier.
> link <- crm_links("10.1016/s0048-7333(00)00087-1")
> crm_text(link, "pdf")
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.
@mark-fangzhou-xie i can't replicate that issue either on or not on the VPN. Are you using the latest version ?
Sorry for my late reply. I am using crminer_0.3.4.99
version.
Thank you! I found that my previous example works now. As I am looking at the content of the documents I downloaded recently, I found the following example where the link for plain
and pdf
are the same and neither of them returns the full text correctly. If I choose pdf
, the link appears that it should be used as plain
and calling crm_text
or crm_pdf
will raise error; however, if I choose 'plain` then the result shows that it is a mixture of some text and many url links.
> library(crminer)
> l <- crm_links("10.1016/j.jet.2015.08.007")
> l
$plain
<url> https://api.elsevier.com/content/article/PII:S0022053115001660?httpAccept=text/plain
$xml
<url> https://api.elsevier.com/content/article/PII:S0022053115001660?httpAccept=text/xml
$pdf
<url> https://api.elsevier.com/content/article/PII:S0022053115001660?httpAccept=text/plain
> crm_pdf(l)
Downloading pdf...
[1] "/Users/xiefangzhou/Library/Caches/R/crminer/PII:S0022053115001660.pdf"
> crm_plain(l)
[1] ....
Interbank market Monetary policy implementation Unconventional monetary policy https://api.elsevier.com/content/object/eid/1-s2.0-S0022053115001660-gr001.sml?httpAccept=%2A%2F%2A
.....
In any case, I think this might be a problem with Elsevier and not with crminer
but I just wonder if you know any workaround for this? Can we extract full text from xml
as well?
Thanks a lot!
After careful examination, I found that the it actually provides full text for us, but just buried among many non-sense links and acknowledgment notes.
Clearly, it begins with lots of links and then captions of figures (that's why I thought we don't have full texts). But immediately after the captions, we have the thank-you note, and then title, author, affiliation, and abstract and main text thereafter.
Sorry for the confusion and never mind about my previous post. Thank you!
Glad it works!
closing this issue, if there are any additional issues please open new issues for each one
Use case from email.
User gave examples of DOIs for journal they access to, and can acces the PDFs in the browser, but via API calls can not access full text. The non-accessible via API DOIs appear to be all inthe range 1993-2003. here's 5 example DOIs for this scenario
The PDFs for these DOIs do exist, but as far as I can tell there;s no way to figure out the URLs for those PDFs