ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
518 stars 69 forks source link

pdf_data(pdf, font_info = TRUE) throws error when page is fullscreen image #108

Closed cutterkom closed 2 years ago

cutterkom commented 2 years ago

When running pdf_data() with font_info = TRUE, it breaks when a page is a fullscreen image.

Error: Error in FUN(X[[i]], ...) : is.data.frame(df) is not TRUE (same behaviour as described here: https://github.com/ropensci/pdftools/issues/88).

Probably a good idea to catch this case.

kimonkrenz commented 2 years ago

I am observing the same problem. Performing pdf_data() with font_info = TRUE on a PDF that includes either a full image or blank page throws the same error:

ListWarning: Column sizes are not equal in DataFrame::push_back, object degrading to List Error in FUN(X[[i]], ...) : is.data.frame(df) is not TRUE

Unfortunately, this is a common case for PDFs. Is there a straight forward workaround for this?

jeroen commented 2 years ago

Do you have an example PDF file and some code so that I can reproduce this?

kimonkrenz commented 2 years ago

Hi @jeroen, please find below a reproducible example using the following two pdf files:

adrianadantas.pdf adrianadantas_altered.pdf

require("pdftools")

# 1 load pdf incl. blank page without using font_info pdf_1 <- pdftools::pdf_data(pdf = "~/Desktop/adrianadantas.pdf")

# 2 load pdf incl. blank page using font_info pdf_2 <- pdftools::pdf_data(pdf = "~/Desktop/adrianadantas.pdf", font_info = TRUE)

# 3 load pdf with removed blank page using font_info pdf_3 <- pdftools::pdf_data(pdf = "~/Desktop/adrianadantas_altered.pdf", font_info = TRUE)

View(pdf_3[[1]])

1 and 3 work as expected, 2 throws the following error: _Warning: Column sizes are not equal in DataFrame::push_back, object degrading to ListWarning: Column sizes are not equal in DataFrame::pushback, object degrading to ListError in FUN(X[[i]], ...) : is.data.frame(df) is not TRUE

jeroen commented 2 years ago

I think it is fixed. Can you try to install the new version:

install.packages("pdftools", repos =  'https://ropensci.r-universe.dev')
kimonkrenz commented 2 years ago

Installed, tested and works perfectly. Many thanks, @jeroen!

jeroen commented 2 years ago

Thanks, I sent it to CRAN