image resolution in ocr and ocr_data - Githubissues

ropensci / tesseract

Bindings to Tesseract OCR engine for R

https://docs.ropensci.org/tesseract

245 stars 26 forks source link

image resolution in ocr and ocr_data #28

Closed dmi3kno closed 6 years ago

dmi3kno commented 6 years ago

I noticed reduced quality of recognition, when I used ocr_data() instead of ocr(HOCR=TRUE) and sure enough you are saving temp file with lower (72dpi) resolution in ocr_data(). Is there any reason for that other than speed?

It actually does not matter what resolution I use when scanning my image, it will be downsampled to 300dpi in temporary file. One way of circumventing it is of course saving my pre-processed image as file before passing the path to ocr().

On the topic of tmp files, I had recently ran into an issue with allocating new temp file names on Windows (bug report). I see that you are vapply- saving each magick "sub-image" (channel/layer?) to a temp file (using magick::image_write) and immediately read it back. What is the reason for doing it (since it is inside the same loop)? Could we get away with magick::image_convert/image_resize piped to OCR? I suspect it has something to do with speed, so you want to enforce lower density and bitmap format. But you have virtually no control over density or format when it comes to you reading image directly from file, so why bother? I am of strong opinion that if image comes in magick class, it should be user's resposibility to flatten and downsample the image to facilitate acceptable OCR speed.

On the subject of processing magick images with lappy/vapply, maybe we could make an exception for "one-page flat magick images" (length(img)==1, where inherits(img, "magick-image")) and treat them as raw? Is it multi-page tiffs that you intend to catch with that vapply?

jeroen commented 6 years ago

The 72dpi is a bug, it should be 300 as well. I'll change that.

The loop is because images are vectorized, so each object may contain many images.

We write it to disk because that is what tesseract needs. It uses the file extension to determine the format. The performance overhead of writing to disk is negligible.

dmi3kno commented 6 years ago

cool thanks. No more questions.

jeroen commented 6 years ago

I don't think setting density downsamples the image. It only sets the dpi property on the image which determines the physical size of the image, which tesseract needs to guess the text size.

So if I understand it correctly, if your scan is 600x600 pixels, and you set density 300, then tesseract assumes your photo is 2 by 2 inches. If you set it to 72dpi then tesseract assumes the photo is 8.3 x 8.3 inches.

dmi3kno commented 6 years ago

Whatever the case, error rate was higher on 72dpi. I also believe user should be able to decide the density. You could probably check magick::image_info and take dpi there before writing it to disk

jeroen commented 6 years ago

Hmmm I think that's what I used to do, but I found we got better results if we force it at 300.

jeroen commented 6 years ago

Are you aware that you don't have to use this part of the code? You can just pass the file path to your png file directly to ocr() and ocr_data() and it wont' use magick at all. So you can create the png image as you prefer, and then feed that to tesseract.

The part of the code we are talking about is only for convenience when users want to pipe magick objects directly to tesseract.

dmi3kno commented 6 years ago

I do make scans in 600 dpi and pre-process them in magick. I just never saved processed images before feeding them to ocr (which I now undertand I should have done).

jeroen commented 6 years ago

I expect you'll get similar results. Again, density won't downsample, it just gives tesseract a hint about the size of the text to look for.