poor performance compared to raw tesseract

imalone commented 5 months ago

I've been trying out gimagereader recently and was struggling with it. I thought the problem was tesseract's OCR, but running tesseract directly produces much better results. Here's the start of a sample scanned from a newspaper article, no options, just "tesseract 20240610_094500.jpg 20240610_094500-1": \====== | News

Dalya Alberge

It is a founding document of the . US and inspired the Declaration ~ of Independence and the purge of

English power from the colonies. ‘But, ironically, George Mason’s \====== [...continues...]

And the start of the same sample scanned in gimagereader (with automatic page segmentation option for tesseract, recognise all, no layout detection or image adjustments): \====== Fi Spe 3 ai ; R Nadia! Os a pt EAS ar Mi eben ied

ERE a7 CIARA TIGA — Dats diay

Narayan,

Snes 5) 4 i 70 ACN ay aaa

LEN cise 7 i

Dalya! Se Loreey or — in | Washington, |

ie i rE 2a clearer ,al \======

This is on Fedora 41 (beta), gimagereader-gtk-3.4.2-1.fc40.x86_64 gimagereader-gtk-3.4.2-1.fc40.x86_64

I can see it links tesseract: $ ldd /usr/bin/gimagereader-gtk|grep tesseract libtesseract.so.5.3.4 => /lib64/libtesseract.so.5.3.4 (0x00007f002bc00000)

And this is the same as my command line tesseract: $ rpm -qf /lib64/libtesseract.so.5.3.4 tesseract-5.3.4-4.fc40.x86_64 $ rpm -qf /bin/tesseract tesseract-5.3.4-4.fc40.x86_64

The file is a jpeg picture taken on a phone, I've tried loading in Gimp, allowing conversion of the embedded colour profile and exporting as jpeg, tiff (lzw) and png. This changes the outputs slightly for both direct tesseract and gimagereader (png and tiff are identical), but the picture remains tesseract extracts a reasonable scan while gimagereader is producing mainly nonsense with a few patches of coherence.

It would be nice to be able to use gimagereader, since the layout detection would be handy (I've tried layout detection and removing any spurious selections and it outputs similar nonsensical output). Any ideas what might be going wrong here?

imalone commented 5 months ago

20240610_094500

MagnusPGBerg commented 2 months ago

I just tested between gImageReader and OCRFeeder, both from the Debian/Devuan unstable repository and both using the same Tesseract. The language is Swedish. gImageReader produces garbage, while OCRFeeder gives between excellent to acceptable results, depending on the quality of the original. I can't understand why the results isn't the same in both applications. I tested all possible settings in gImageReader without getting better results.

manisandro / gImageReader

poor performance compared to raw tesseract #675