Closed dmi3kno closed 6 years ago
The 72dpi is a bug, it should be 300 as well. I'll change that.
The loop is because images are vectorized, so each object may contain many images.
We write it to disk because that is what tesseract needs. It uses the file extension to determine the format. The performance overhead of writing to disk is negligible.
cool thanks. No more questions.
I don't think setting density downsamples the image. It only sets the dpi property on the image which determines the physical size of the image, which tesseract needs to guess the text size.
So if I understand it correctly, if your scan is 600x600 pixels, and you set density 300
, then tesseract assumes your photo is 2 by 2 inches. If you set it to 72dpi then tesseract assumes the photo is 8.3 x 8.3 inches.
Whatever the case, error rate was higher on 72dpi. I also believe user should be able to decide the density. You could probably check magick::image_info
and take dpi there before writing it to disk
Hmmm I think that's what I used to do, but I found we got better results if we force it at 300.
Are you aware that you don't have to use this part of the code? You can just pass the file path to your png file directly to ocr()
and ocr_data()
and it wont' use magick at all. So you can create the png image as you prefer, and then feed that to tesseract.
The part of the code we are talking about is only for convenience when users want to pipe magick objects directly to tesseract.
I do make scans in 600 dpi and pre-process them in magick
. I just never saved processed images before feeding them to ocr (which I now undertand I should have done).
I expect you'll get similar results. Again, density won't downsample, it just gives tesseract a hint about the size of the text to look for.
I noticed reduced quality of recognition, when I used
ocr_data()
instead ofocr(HOCR=TRUE)
and sure enough you are saving temp file with lower (72dpi) resolution inocr_data()
. Is there any reason for that other than speed?It actually does not matter what resolution I use when scanning my image, it will be downsampled to 300dpi in temporary file. One way of circumventing it is of course saving my pre-processed image as file before passing the path to
ocr()
.On the topic of tmp files, I had recently ran into an issue with allocating new temp file names on Windows (bug report). I see that you are vapply- saving each
magick
"sub-image" (channel/layer?) to a temp file (usingmagick::image_write
) and immediately read it back. What is the reason for doing it (since it is inside the same loop)? Could we get away withmagick::image_convert/image_resize
piped to OCR? I suspect it has something to do with speed, so you want to enforce lower density and bitmap format. But you have virtually no control overdensity
orformat
when it comes to you reading image directly from file, so why bother? I am of strong opinion that if image comes inmagick
class, it should be user's resposibility to flatten and downsample the image to facilitate acceptable OCR speed.On the subject of processing
magick
images with lappy/vapply, maybe we could make an exception for "one-page flat magick images" (length(img)==1
, whereinherits(img, "magick-image")
) and treat them as raw? Is it multi-page tiffs that you intend to catch with thatvapply
?