ropensci-archive / crminer

:no_entry: ARCHIVED :no_entry: Fetch 'Scholary' Full Text from 'Crossref'
Other
17 stars 5 forks source link

elsevier full text issues #37

Closed sckott closed 4 years ago

sckott commented 4 years ago

Use case from email.

User gave examples of DOIs for journal they access to, and can acces the PDFs in the browser, but via API calls can not access full text. The non-accessible via API DOIs appear to be all inthe range 1993-2003. here's 5 example DOIs for this scenario

The PDFs for these DOIs do exist, but as far as I can tell there;s no way to figure out the URLs for those PDFs

hubgit commented 4 years ago

The DOIs have links in their Crossref metadata:

e.g. http://api.crossref.org/works/10.1006/jeth.1993.1066

so I imagine the PDF (for clients with access) would be at

https://api.elsevier.com/content/article/PII:S0022053183710665?httpAccept=application/pdf

sckott commented 4 years ago

sorry to not include further details. - correct, that metadata is available. but when you curl that url (if you have access) you only get the abstract

hubgit commented 4 years ago

Ok, I guess it must be an issue with the API (permissions or something else), because the PDF is there for open access articles:

https://api.elsevier.com/content/article/PII:S0370269310012608?httpAccept=application/pdf

sckott commented 4 years ago

Ah ha, should have caught that. You're right, Crossref doesn't give a url for pdf, but that doesn't mean it doesn't exist

sckott commented 4 years ago

thanks @hubgit for the help

x <- crm_links('10.1006/jeth.1993.1066')
out <- crm_text(x, "pdf")
out$text
#> [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

now to work around knowing how to detect if a pdf is a OCR'ed or not

fangzhou-xie commented 4 years ago

I believe the crm_text will use pdftools under the hood to get the full text. I also notice that pdftools has another function that will do OCR by tesseract (https://cran.r-project.org/web/packages/pdftools/pdftools.pdf). I wonder if it is possible, in this Elsevier case, to get the un-OCR'ed PDF by API and then use pdf_ocr_text to get texts from there?

This Elsevier problem is not unique to the Journal of Economic Theory. Several other Elsevier journals have the same issue, including:

Journal of Multivariate Analysis:

10.1006/jmva.1993.1001
10.1006/jmva.1993.1017
10.1006/jmva.1997.1714

Journal of Urban Economics:

10.1006/juec.1993.1001
10.1006_juec.1993.1026

and possibly many others as well.

sckott commented 4 years ago

thanks @mark-fangzhou-xie for the further input.

yes, im aware of the ocr ability with pdftools. i'm not sure it'd be a good idea to run pdftools::pdf_ocr_text automatically for the user within crm_text() as at least in some example pdfs it takes quite a while to run - especially takes a lot longer than pdftools::pdf_text thats used right now. I'd lean more towards marking those pdfs that are OCRed somehow - and telling the user to run pdftools::pdf_ocr_text as a separate step probably.

however, before doing that, it'd be nice to have a robust way of detecting if a PDF has embedded text, or if it's from scanning. Anyone know how to do that? @hubgit ?

fangzhou-xie commented 4 years ago

Thank you for your reply!

i'm not sure it'd be a good idea to run pdftools::pdf_ocr_text automatically for the user within crm_text() as at least in some example pdfs it takes quite a while to run

I understand that the OCR process is computationally expensive, but I wonder if it is possible to pass an argument in crm_text() that will call OCR functionality by pdftools::pdf_ocr_text? For example, force.ocr = TRUE to OCR the downloaded PDF. This option could be set FALSE as the default value and it will not be run automatically, but only for one who really wants to use it (me for example). In our application, we would be very eager to have that.

Thank you very much!

hubgit commented 4 years ago

however, before doing that, it'd be nice to have a robust way of detecting if a PDF has embedded text, or if it's from scanning. Anyone know how to do that? @hubgit ?

You could certainly run something to extract the text from the PDF without OCR first, and only run the OCR if there's no content.

sckott commented 4 years ago

thanks @hubgit - that was my initial thought, but I wondered if there some smarter method I didn't know about. i'll do that then

sckott commented 4 years ago

@mark-fangzhou-xie that's probably a good compromise to have a parameter - will have a go soon

fangzhou-xie commented 4 years ago

@mark-fangzhou-xie that's probably a good compromise to have a parameter - will have a go soon

Thank you very much and I am looking forward to it! Could you please kindly leave a comment here under this thread when it's done? No pressure but I would like to check out this feature as soon as possible.

sckott commented 4 years ago

@mark-fangzhou-xie reinstall, e.g.,

doi <- '10.1006/jeth.1993.1066'
z <- crm_links(doi)
crm_text(z, "pdf", try_ocr = TRUE)

if try_ocr=TRUE we try to extract regular text first, and if that fails, then we try ocr extraction

note that tesseract prints progress like

Converting page 1 to PII:S0094119083710016_1.png... done!
Converting page 2 to PII:S0094119083710016_2.png... done!
...

which I can't figure out how to suppress, so we're stuck with that for now

we should probably add ability to cache the results of extracting OCRed text since that step takes so long to run

fangzhou-xie commented 4 years ago

Thank you soooo much! You are really a life-saver!

I have tried the example and it worked very well. It for sure took a while but I think that is the nature of the OCR process (to be computationally expensive).

I agree with the cache idea. I wonder if it worth creating an SQLite database somewhere under the package folder (using RSQLite for example) and save the OCRed text in the database for later retrieval purposes?

sckott commented 4 years ago

So we do use caching now if you use cache=TRUE but we just cache files to disk, without any database. I avoided a database b/c I thought users may want to access pdfs/xmls/etc. outside of using this pkg, and even in other prog. languages or other workflows. So plain files on disk I thought provided the lowest barrier to reuse downstream.

Do you care how data is cached? do you use files/data outside of using this package?

fangzhou-xie commented 4 years ago

I can see that PDFs are cached under ~/Library/Caches/R/crminer (on my macOS Catalina) and I completely agree with you that users could later use the plain files in other workflows. What I proposed was following your previous comment on storing OCRed results from scanned PDFs, where texts cannot be directly extracted from the plain PDF, and by storing the texts somewhere could potentially save some time later on. Using a database may be somewhat more restrictive, but from my experience, I don't really know what other options are out there to be useful here.

For my application, I only care about the plain texts of the articles and save them as .txt files anyway. Before you implemented the OCR option, I would have to take the downloaded PDF (for those problematic Elsevier articles) under the cached folder and call OCR functionality myself to get the full texts. Thanks to you, I can get full texts directly by calling the crm_text() function in one go.

After this full-text-collection process, we will probably move on to Python for further analysis based on the full texts. But again, all we care so far is really the plain text.

sckott commented 4 years ago

@mark-fangzhou-xie Do you care how txt files are formatted? Do you just cat() the text to a file? or do you put separate pages in different files?

fangzhou-xie commented 4 years ago
doi <- "...."
link <- crm_links(doi)
fulltext <- crm_text(link, type="pdf", overwrite_unspecified = T)
fulltext <- fulltext$text
# or
fulltext <- crm_text(link, type="plain")
# and save it somewhere
fileConn <- file(savepath)
writeLines(fulltext %>% as.character(), fileConn)
close(fileConn)

By doing so, each individual article (paper) will be saved within the same .txt file. I am using papers as the atomic element. And no, I don't put different pages into different files.

sckott commented 4 years ago

thanks. i think saving the texts of each paper to a file makes most sense - and makes it easy to work into other workflows like within python or command line tools. SQLite is still a good idea, may consider that if files runs into some issues

fangzhou-xie commented 4 years ago

I only have one concern for using SQLite though. I currently have 50+ GB cached PDFs under "~/Library/Caches/R/crminer" folder. The SQLite database may as well goes to very large and thus affect its reading/writing performance. If someone collects lots of PDFs using crminer, they may face performance issue when they have collected a large cache.sqlite file.

But I guess using any other methods to cache things will have a similar issue. And I doubt if anyone is really collecting that much amount of information.

Thank you!

sckott commented 4 years ago

thanks for that info. Good point about performance if a sqlite gets big. I'm not familiar with sqlite issues at a large scale, we'll look into it if we go that route.

for now i think i'll pursue the text file route

fangzhou-xie commented 4 years ago

Thank you for your effort in this package (and other packages as well). These are indeed very useful tools for people who wish to study scholarly texts. I hope that my use case could be helpful in testing the usage of this package on a large scale.

fangzhou-xie commented 4 years ago

Yet there is another issue with Elsevier that I have found.

> link <- crm_links("10.1016/j.intacc.2003.09.001")
> link
$xml
<url> https://api.elsevier.com/content/article/PII:S0020706303000694?httpAccept=text/xml

$plain
<url> https://api.elsevier.com/content/article/PII:S0020706303000694?httpAccept=text/plain
> crm_text(link, "plain", verbose=T)
* Found bundle for host api.elsevier.com: 0x7f861e846bd0 [can pipeline]
* Could pipeline, but not asked to!
* Re-using existing connection! (#1) with host api.elsevier.com
* Connected to api.elsevier.com (34.204.27.83) port 443 (#1)
> GET /content/article/PII:S0020706303000694 HTTP/1.1
Host: api.elsevier.com
User-Agent: libcurl/7.64.1 r-curl/4.3 crul/0.9.0
Accept-Encoding: gzip, deflate
Accept: application/json, text/xml, application/xml, */*

< HTTP/1.1 200 OK
< allow: GET
< Content-Encoding: gzip
< Content-Type: application/json;charset=UTF-8
< Date: Wed, 06 May 2020 15:00:24 GMT
< Last-Modified: Thu, 14 May 2015 09:49:50 GMT
< Server: Apache-Coyote/1.1
< vary: Origin
< WARNING: Unauthorized request results in minimized metadata response.
< X-ELS-APIKey: 7968ea68ad28c4627d768d46292800fe
< X-ELS-ReqId: 25599653-a3d5-4e93-bf10-26801796ad1a
< X-ELS-ResourceVersion: default
< X-ELS-Status: WARNING - Unauthorized request results in minimized metadata response.
< X-ELS-TransId: 97b76663-3247-4bcd-94fe-97c2cb968992
< Content-Length: 520
< Connection: keep-alive
< 
* Connection #1 to host api.elsevier.com left intact

I notice there is a warning saying that WARNING: Unauthorized request results in minimized metadata response. This should not have happened as I for sure have the access to their journal. The result turns out to be a json-like matadata, instead of the full text.

I wonder if this one is connected to the Elsevier first-page problem #43 or there is something special for this journal?

hubgit commented 4 years ago

A JSON response seems reasonable for a request that is essentially curl 'https://api.elsevier.com/content/article/PII:S0020706303000694' -H 'Accept: application/json'.

I don't know if something changed, but neither text/plain nor application/pdf seem to be acceptable any more:

curl 'https://api.elsevier.com/content/article/PII:S0370269310012608' -H 'Accept: text/plain'
<service-error><status><statusCode>INVALID_INPUT</statusCode><statusText>View parameter specified in request is not valid</statusText></status></service-error>
curl 'https://api.elsevier.com/content/article/PII:S0370269310012608' -H 'Accept: application/pdf' 
<service-error><status><statusCode>INVALID_INPUT</statusCode><statusText>Accept header value 'application/pdf' is restricted</statusText></status></service-error>
sckott commented 4 years ago

its a result of trying to deal with lots of diff publishers, all with different url patterns, and in this case, also DOIs that have been transferred between publishers - so thats fun

fangzhou-xie commented 4 years ago

Sometimes, journals will be transferred to a different publisher. Previous articles are hosted on the old publisher's website, while newer articles are hosted on new publisher's website.

Example: Review of Financial Economics

> crm_links("10.1002/rfe.1102")
$pdf
<url> https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2Frfe.1102

$pdf
<url> https://onlinelibrary.wiley.com/doi/pdf/10.1002/rfe.1102

$xml
<url> https://onlinelibrary.wiley.com/doi/full-xml/10.1002/rfe.1102

$unspecified
<url> https://onlinelibrary.wiley.com/doi/pdf/10.1002/rfe.1102

> crm_links("10.1016/j.rfe.2016.09.003")
$plain
<url> http://api.elsevier.com/content/article/PII:S1058330015301129?httpAccept=text/plain

$xml
<url> http://api.elsevier.com/content/article/PII:S1058330015301129?httpAccept=text/xml

$unspecified
<url> https://onlinelibrary.wiley.com/doi/full/10.1016/j.rfe.2016.09.003

It used to be an Elsevier journal, but now a Wiley one.

sckott commented 4 years ago

That example should work now.

sckott commented 4 years ago

@mark-fangzhou-xie Added caching for texts extracted from pdfs - for both those from normal pdfs and from ocr'ed pdfs. Also removed the cache parameter, seemed not useful, so all requests will use cached files if they exist.

fangzhou-xie commented 4 years ago

Thank you! This feature is really cool.

I just found another example, and this time the parsing PDF failure happend for Elsevier.

> link <- crm_links("10.1016/s0048-7333(00)00087-1")
> crm_text(link, "pdf")
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.
sckott commented 4 years ago

@mark-fangzhou-xie i can't replicate that issue either on or not on the VPN. Are you using the latest version ?

fangzhou-xie commented 4 years ago

Sorry for my late reply. I am using crminer_0.3.4.99 version.

fangzhou-xie commented 4 years ago

Thank you! I found that my previous example works now. As I am looking at the content of the documents I downloaded recently, I found the following example where the link for plain and pdf are the same and neither of them returns the full text correctly. If I choose pdf, the link appears that it should be used as plain and calling crm_text or crm_pdf will raise error; however, if I choose 'plain` then the result shows that it is a mixture of some text and many url links.

> library(crminer)
> l <- crm_links("10.1016/j.jet.2015.08.007")
> l
$plain
<url> https://api.elsevier.com/content/article/PII:S0022053115001660?httpAccept=text/plain

$xml
<url> https://api.elsevier.com/content/article/PII:S0022053115001660?httpAccept=text/xml

$pdf
<url> https://api.elsevier.com/content/article/PII:S0022053115001660?httpAccept=text/plain

> crm_pdf(l)
Downloading pdf...
[1] "/Users/xiefangzhou/Library/Caches/R/crminer/PII:S0022053115001660.pdf"

> crm_plain(l)
[1] ....
Interbank market Monetary policy implementation Unconventional monetary policy    https://api.elsevier.com/content/object/eid/1-s2.0-S0022053115001660-gr001.sml?httpAccept=%2A%2F%2A 
.....

In any case, I think this might be a problem with Elsevier and not with crminer but I just wonder if you know any workaround for this? Can we extract full text from xml as well? Thanks a lot!

fangzhou-xie commented 4 years ago

After careful examination, I found that the it actually provides full text for us, but just buried among many non-sense links and acknowledgment notes.

Example ``` .... https://s3-eu-west-1.amazonaws.com/prod-ucs-content-store-eu- west/content/pii:S0022053115001660/STRIPIN/image/gif/704c275b8 f745e8d4c87d2ec951a10b9/si98.gif https://s3.amazonaws.com/prod-u cs-content-store-us-east/content/pii:S0022053115001660/STRIPIN/image/gif/704c275b8f74 5e8d4c87d2ec951a10b9/si98.gif si98 si98.gif gif 136 11 10 ALTIMG 1-s2.0-S0022053115001660-si99.gif https://s3-eu-west-1.amazona ws.com/prod-ucs-content-store-eu-west/content/pii:S0022053115001660/STRIPIN/image/gif/18157c3c 193dc8e06b7ae886a4822753/si99.gif https://s3.amazonaws.com/p rod-ucs-content-store-us-east/content/pii:S0022053115001660/ST RIPIN/image/gif/18157c3c193dc8e06b7ae886a4822753/si99.gif si99 si99.gif gif 1105 36 322 ALTIMG YJETH 4472 S0022-0531(15)0016 6-0 10.1016/j.jet.2015.08.007 Elsevier Inc. Fig. 1 Excess reserves, ECB policy rates and Eonia. (For interpretation of the colors in this figure, the reader is referred to the web version of this article.) Fig. 2 EONIA statistics vs. excess reserves. Fig. 3 Time line – intraday reserve holdings. Fig. 4 Willingness to pay for reserves, W ′ (X ). ( For interpretation of the references to color in this figure, the reader is referred to the web version of this article.) Fig. 5 Matches, trading volume, and average traded size. (For interpretation of the colors in this figure, the reader is referred to the web version of this article.) Fig. 6 The effective i nterbank rate. (For interpretation of the colors in this figure, the reader is referred to the web version of this article.) Fig. 7 In terbank rate dispersion. (For interpretation of the colors in this figure, the reader is referre d to the web version of this article.) Fig. 8 End-of-day borro wing from the central bank. (For interpretation of the colors in this figure, the reader is referred to the web version of this article.) Fig. 9 Efficacy measures. (For interpretation of the colors in this figure, the reader is referred to the web version of this article.) ☆ We thank Jonathan Chiu, Marie Hoerova, Jens Tapking, and audiences at the 2013 ECB Workshop on “Structural changes in money markets: Implications for monetary policy implementation”, the joint CFS-ECB Lunchtime seminar, the 2014 Search and Matching workshop in Singapore, the Bank of Canada and the LAEF workshop on “Money as a Medium of Exchange: KW at 25!” for useful comments. We are especially thankful to the editor Ricardo Lagos and two anonymous referees whose comments greatly improved the paper. C. Monnet gratefully acknowledges support of SNF grant 100018_152911 . Part of this project was completed when Monnet was a research fellow at the BIS and he thanks the BIS for its hospitality and support. A search-based model of the interbank money market and monetary policy implementation Morten Bech a morten.bech@bis.org Cyril Monnet b ⁎ cyril.monnet@gmail.com a Bank for International Settlements, Switzerland Bank for International Settlements Switzerland b University of Bern, Study Center Gerzensee, Switzerland University of Bern Study Center Gerzensee Switzerland ⁎ Corresponding author. Abstract We present a search-based model of the interbank money market and monetary policy implementation. Banks are subject to reserve requirements and the central bank tenders reserves. Interbank payments redistribute holdings and banks trade with each other in a decentralized (over-the-counter) market. The central bank provides standing facilities where banks can either deposit surpluses or borrow to cover shortfalls of reserves overnight. The model provides insights on liquidity, trading volume, and rate dispersion in the interbank market – features largely absent from the canonical models in the tradition of Poole (1968) – and fits a number of stylized facts for the Eurosystem observed during the recent period of unconventional monetary policies. Moreover, it provides insights on the implications of different market structures. JEL classification G21 E5 Keywords Interbank market Monetary policy implementation Unconventional monetary policy 1 Introduction We present a model of the interbank money market and monetary policy implementation in a corridor system like that used by the European Central Bank (ECB). .... ```

Clearly, it begins with lots of links and then captions of figures (that's why I thought we don't have full texts). But immediately after the captions, we have the thank-you note, and then title, author, affiliation, and abstract and main text thereafter.

Sorry for the confusion and never mind about my previous post. Thank you!

sckott commented 4 years ago

Glad it works!

sckott commented 4 years ago

closing this issue, if there are any additional issues please open new issues for each one