error reporting in pdf_text

ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents

https://docs.ropensci.org/pdftools

Other

510 stars 69 forks source link

error reporting in pdf_text #3

Open sckott opened 8 years ago

sckott commented 8 years ago

Errors are a little hard to parse, they are call combined into one string. Though maybe this is is good enough?

an example

download.file("https://github.com/sckott/scott/raw/gh-pages/pdfs/Chamberlain%26Rudgers2011EvolEcol.pdf", 
              "paper.pdf")
pdftools::pdf_text('paper.pdf')

#> poppler/error: Invalid shared object hint table offsetpoppler/error: Failed to get object num from 
#> hint tables for page 1poppler/error: Failed parsing page 1 using hint tablespoppler/error: Failed to 
#> get object num from hint tables for page 1poppler/error: Failed parsing page 1 using 
#> hint tablespoppler/error: Failed to get object num from hint tables for page 1poppler/error: 
#> Failed parsing page 1 using hint tables
#> ... cutoff

sckott commented 8 years ago

and wonder if this is something that can be fixed in pdftools, or if the pdf itself is malformed

jeroen commented 8 years ago

This is very difficult. The errors messages printed to stderr by the C library, we never actually get them in R. When compiling libpoppler you can configure some settings on how it deals with errors, but that is usually beyond our control from the R interface.

jeroen commented 8 years ago

Not sure if why this file is giving these errors. I'll ask on the poppler mailing list.

jeroen commented 8 years ago

If you find more problematic PDF files can you add them to this issue? That is very helpful for testing / debugging.

sckott commented 8 years ago

Yep, will do

jeroen commented 8 years ago

OK I found a way to set a custom error callback: https://github.com/jeroenooms/pdftools/commit/0060f146675820a6edf9b107b8f5ef0ed1220840. So the parsing errors now show up in R as messages, which is much nicer.

Still don't know why your pdf is giving so many errors though.

sckott commented 8 years ago

Awesome. Have been reviewing a paper, will get more egs soon...

sckott commented 8 years ago

another eg

download.file("https://github.com/sckott/scott/raw/gh-pages/pdfs/ChamberlainEtal2010Oecologia_journalcopy.pdf", 
              "paper.pdf")
pdftools::pdf_text('paper.pdf')
#> error: Invalid shared object hint table offset
#> error: Failed to get object num from hint tables for page 1
#> error: Failed parsing page 1 using hint tables
#> error: Failed to get object num from hint tables for page 1
#>  ...  cutoff

jeroen commented 8 years ago

Are you getting the same errors with the pdftotext command line utility (which is included with the same poppler package from brew)?

pdftotext ChamberlainEtal2010Oecologia_journalcopy.pdf

sckott commented 8 years ago

Yes, same errors,

Syntax Warning: Invalid shared object hint table offset
Syntax Warning: Failed to get object num from hint tables for page 1
Syntax Warning: Failed parsing page 1 using hint tables

...

sckott commented 8 years ago

All of these give errors of various kinds https://github.com/sckott/pdftoolspdfs - let me know if you want me to paste in the errors

jeroen commented 8 years ago

But you still get the text correct, even though there were parsing errors on some of the elements? Things like watermarks seem to cause conversion errors, but all the main text should be there?

sckott commented 8 years ago

Sorry, yes, the text does come back fine

patperu commented 5 years ago

Probably the same issue (the result looks good in both cases)

library(pdftools) # (pdftools    * 2.2     2019-03-10 [1] CRAN (R 3.5.3))
download.file("http://www.staedtestatistik.de/fileadmin/vdst/Dortmund2019/503FJT2019_RShiny.pdf", "paper.pdf", mode = "wb")

bitmap <- pdftools::pdf_render_page('paper.pdf')
#> PDF error: Invalid least number of objects reading page offset hints table
str(bitmap, 1)
#>  'bitmap' raw [1:4, 1:842, 1:595] ff ff ff ff ...

txt <- pdf_text("paper.pdf")
#> PDF error: Invalid least number of objects reading page offset hints table
str(txt, 1)
#>  chr [1:14] "  Nutzung von R        R-Shiny-Apps     Zeitreihenapp      Tourenplanung Verteilungsvergleiche Fazit\r\nVDSt Fr"| __truncated__ ...

^{Created on 2019-06-06 by the reprex package (v0.3.0)}