ropensci / tesseract

Bindings to Tesseract OCR engine for R
https://docs.ropensci.org/tesseract
245 stars 26 forks source link

Tesseract in R not recognizing “&” aka Ampersand #35

Closed harshpdave closed 5 years ago

harshpdave commented 6 years ago

I am supposed to write a code to read in text from images using R. I am using the Tesseract and Magick packages for doing the same and am facing an issue where the code converts an "&" to "8:" I have attached the image that I am using as an input. testimage Below is the code that I am running:- _test2 <- image_read("C:/Users/admin/Desktop/testimage.jpg") %>% image_resize("2000") %>% image_convert(colorspace = 'gray') %>% image_trim() %>% image_ocr() cat(test2) write.table(test2, "C:/Users/admin/Desktop/output2.txt", sep="\t")_

I have ALSO tried to modify it and try the below, but still the result is the same:- _wl = paste(paste(letters, LETTERS, collapse="", sep=""), "0123456789&;") engine <- tesseract(options = list(tessedit_char_whitelist = wl), cache=FALSE) test3 <- image_read("C:/Users/admin/Desktop/testimage.jpg") %>% image_resize("500") %>% image_convert(colorspace = 'gray') %>% image_trim() %>% image_ocr() engine <- tesseract(options = list(tessedit_char_whitelist = ";&")) cat(test3)_

Below is the output that I am getting:- No relation between boycotting panchayat polls 8: Article 35A: Subramanian Swamy

I have gone through this website and have also posted same question on Stackoverflow but it has been several hours and did not get any solution for the same.

If someone can help, that will be a great help.

FunnyCheese commented 6 years ago

Any solutions?

jeroen commented 6 years ago

Try with the new Tesseract 4. Run this in a clean R session (makes sure tesseract is not loaded):

devtools::install_github("ropensci/tesseract")