Open MeyerGrace opened 6 years ago
having same issues in 2022
My intuition is that those pdfs can't get read becuase of the printing mechanism that created those files. In my case the text in mine pdf in not-selectable and from this url I learned that
(when printing) if the application doesn’t send “real” text, but instead an image of text, PDFCreator can’t convert this back into “real” text.
which means the content of my PDF is rather an image than a text. I used pdf_ocr_text()
to read the text instead of pdf_text()
Hi, I'm trying to read ~1700 pdfs from urls and most are working but ~150 are not. For example this one: pdftools::pdf_text("http://www.dt.tesoro.it/export/sites/sitodt/modules/documenti_en/debito_pubblico/risultati_aste/risultati_aste_btp_10_anni/10-Years-BTP-Auction-Results-30.12.2002.pdf") gives an empty string.
This is notified as a bug here: https://github.com/ropensci/pdftools/issues/24 but downloading the dev version didn't fix the issue for me.
However, this pdf has quite different meta data to the ones that read properly as it is not linearised, has many "\n" in its metadata and has layout of one-column. pdftools::pdf_info("http://www.dt.tesoro.it/export/sites/sitodt/modules/documenti_en/debito_pubblico/risultati_aste/risultati_aste_btp_10_anni/10-Years-BTP-Auction-Results-30.12.2002.pdf")
Is this a new bug or the same bug, or is this known functionality because this pdf is bad?
Thanks for any help- this package has already saved so many hours of work!
Cheers, Grace
An example that reads as expected: pdftools::pdf_text("http://www.dt.tesoro.it/export/sites/sitodt/modules/documenti_en/debito_pubblico/risultati_aste/risultati_aste_btp_10_anni/10-Year-BTP--Auction-Results-27.02.2.pdf")
sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] pdftools_1.8
loaded via a namespace (and not attached): [1] compiler_3.5.1 tools_3.5.1 Rcpp_0.12.18