Open sckott opened 8 years ago
and wonder if this is something that can be fixed in pdftools, or if the pdf itself is malformed
This is very difficult. The errors messages printed to stderr by the C library, we never actually get them in R. When compiling libpoppler you can configure some settings on how it deals with errors, but that is usually beyond our control from the R interface.
Not sure if why this file is giving these errors. I'll ask on the poppler mailing list.
If you find more problematic PDF files can you add them to this issue? That is very helpful for testing / debugging.
Yep, will do
OK I found a way to set a custom error callback: https://github.com/jeroenooms/pdftools/commit/0060f146675820a6edf9b107b8f5ef0ed1220840. So the parsing errors now show up in R as messages, which is much nicer.
Still don't know why your pdf is giving so many errors though.
Awesome. Have been reviewing a paper, will get more egs soon...
another eg
download.file("https://github.com/sckott/scott/raw/gh-pages/pdfs/ChamberlainEtal2010Oecologia_journalcopy.pdf",
"paper.pdf")
pdftools::pdf_text('paper.pdf')
#> error: Invalid shared object hint table offset
#> error: Failed to get object num from hint tables for page 1
#> error: Failed parsing page 1 using hint tables
#> error: Failed to get object num from hint tables for page 1
#> ... cutoff
Are you getting the same errors with the pdftotext
command line utility (which is included with the same poppler package from brew)?
pdftotext ChamberlainEtal2010Oecologia_journalcopy.pdf
Yes, same errors,
Syntax Warning: Invalid shared object hint table offset
Syntax Warning: Failed to get object num from hint tables for page 1
Syntax Warning: Failed parsing page 1 using hint tables
...
All of these give errors of various kinds https://github.com/sckott/pdftoolspdfs - let me know if you want me to paste in the errors
But you still get the text correct, even though there were parsing errors on some of the elements? Things like watermarks seem to cause conversion errors, but all the main text should be there?
Sorry, yes, the text does come back fine
Probably the same issue (the result looks good in both cases)
library(pdftools) # (pdftools * 2.2 2019-03-10 [1] CRAN (R 3.5.3))
download.file("http://www.staedtestatistik.de/fileadmin/vdst/Dortmund2019/503FJT2019_RShiny.pdf", "paper.pdf", mode = "wb")
bitmap <- pdftools::pdf_render_page('paper.pdf')
#> PDF error: Invalid least number of objects reading page offset hints table
str(bitmap, 1)
#> 'bitmap' raw [1:4, 1:842, 1:595] ff ff ff ff ...
txt <- pdf_text("paper.pdf")
#> PDF error: Invalid least number of objects reading page offset hints table
str(txt, 1)
#> chr [1:14] " Nutzung von R R-Shiny-Apps Zeitreihenapp Tourenplanung Verteilungsvergleiche Fazit\r\nVDSt Fr"| __truncated__ ...
Created on 2019-06-06 by the reprex package (v0.3.0)
Errors are a little hard to parse, they are call combined into one string. Though maybe this is is good enough?
an example