ropensci / tesseract

Bindings to Tesseract OCR engine for R
https://docs.ropensci.org/tesseract
244 stars 26 forks source link

Tesseract example not working due to errors in tiff:writeTIFF #19

Closed lodderig closed 6 years ago

lodderig commented 6 years ago

I am unable to run the Tesseract example

library(pdftools)
library(tiff)

# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]

# Render pdf to jpeg/tiff image
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")

# Extract text from images
out <- ocr("page.tiff")
cat(out)

tiff::writeTIFF triggers an error

tiff::writeTIFF(bitmap, "page.tiff")
Error in tiff::writeTIFF(bitmap, "page.tiff") : 
  INTEGER() can only be applied to a 'integer', not a 'raw'
knapply commented 6 years ago

I just came across this myself.

The solution seems to just be setting pdf_render_page()'s numeric= argument to TRUE so the output is the raw array.

bitmap <- pdf_render_page(news, dpi = 300, 
                          numeric = TRUE) # the modification

tiff::writeTIFF(bitmap, "page.tiff")

@jeroen Here's a starting point when you get a chance to investigate. The error that tiff::writeTIFF() throws is confusing, but a README and documentation fix is probably all it needs. I've been meaning to actually start contributing to packages where I can, but I don't know when I'll actually take the plunge. Cheers.

num_TRUE_bitmap <- pdf_render_page(news, dpi = 300, 
                                   numeric = TRUE)
tiff::writeTIFF(num_TRUE_bitmap, "page.tiff")

class(num_TRUE_bitmap)
# [1] "array"
num_FALSE_bitmap <- pdf_render_page(news, dpi = 300, 
                                    numeric = FALSE)

tiff::writeTIFF(num_FALSE_bitmap, "page.tiff")

# Error in tiff::writeTIFF(num_FALSE_bitmap, "page.tiff") : 
#   INTEGER() can only be applied to a 'integer', not a 'raw'

class(num_FALSE_bitmap)
# [1] "bitmap" "rgba" 
jeroen commented 6 years ago

updated the readme