ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
523 stars 71 forks source link

pdf_text returns empty strings for specific files V2 #42

Open MeyerGrace opened 6 years ago

MeyerGrace commented 6 years ago

Hi, I'm trying to read ~1700 pdfs from urls and most are working but ~150 are not. For example this one: pdftools::pdf_text("http://www.dt.tesoro.it/export/sites/sitodt/modules/documenti_en/debito_pubblico/risultati_aste/risultati_aste_btp_10_anni/10-Years-BTP-Auction-Results-30.12.2002.pdf") gives an empty string.

This is notified as a bug here: https://github.com/ropensci/pdftools/issues/24 but downloading the dev version didn't fix the issue for me.

However, this pdf has quite different meta data to the ones that read properly as it is not linearised, has many "\n" in its metadata and has layout of one-column. pdftools::pdf_info("http://www.dt.tesoro.it/export/sites/sitodt/modules/documenti_en/debito_pubblico/risultati_aste/risultati_aste_btp_10_anni/10-Years-BTP-Auction-Results-30.12.2002.pdf")

Is this a new bug or the same bug, or is this known functionality because this pdf is bad?

Thanks for any help- this package has already saved so many hours of work!

Cheers, Grace

An example that reads as expected: pdftools::pdf_text("http://www.dt.tesoro.it/export/sites/sitodt/modules/documenti_en/debito_pubblico/risultati_aste/risultati_aste_btp_10_anni/10-Year-BTP--Auction-Results-27.02.2.pdf")

sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] pdftools_1.8

loaded via a namespace (and not attached): [1] compiler_3.5.1 tools_3.5.1 Rcpp_0.12.18

MarcinKosinski commented 2 years ago

having same issues in 2022

MarcinKosinski commented 2 years ago

My intuition is that those pdfs can't get read becuase of the printing mechanism that created those files. In my case the text in mine pdf in not-selectable and from this url I learned that

(when printing) if the application doesn’t send “real” text, but instead an image of text, PDFCreator can’t convert this back into “real” text.

which means the content of my PDF is rather an image than a text. I used pdf_ocr_text() to read the text instead of pdf_text()