Open imalone opened 5 months ago
I just tested between gImageReader and OCRFeeder, both from the Debian/Devuan unstable repository and both using the same Tesseract. The language is Swedish. gImageReader produces garbage, while OCRFeeder gives between excellent to acceptable results, depending on the quality of the original. I can't understand why the results isn't the same in both applications. I tested all possible settings in gImageReader without getting better results.
I've been trying out gimagereader recently and was struggling with it. I thought the problem was tesseract's OCR, but running tesseract directly produces much better results. Here's the start of a sample scanned from a newspaper article, no options, just "tesseract 20240610_094500.jpg 20240610_094500-1": \====== | News
Dalya Alberge
It is a founding document of the . US and inspired the Declaration ~ of Independence and the purge of
English power from the colonies. ‘But, ironically, George Mason’s \====== [...continues...]
And the start of the same sample scanned in gimagereader (with automatic page segmentation option for tesseract, recognise all, no layout detection or image adjustments): \====== Fi Spe 3 ai ; R Nadia! Os a pt EAS ar Mi eben ied
ERE a7 CIARA TIGA — Dats diay
Narayan,
Snes 5) 4 i 70 ACN ay aaa
LEN cise 7 i
Dalya! Se Loreey or — in | Washington, |
ie i rE 2a clearer ,al \======
This is on Fedora 41 (beta), gimagereader-gtk-3.4.2-1.fc40.x86_64 gimagereader-gtk-3.4.2-1.fc40.x86_64
I can see it links tesseract: $ ldd /usr/bin/gimagereader-gtk|grep tesseract libtesseract.so.5.3.4 => /lib64/libtesseract.so.5.3.4 (0x00007f002bc00000)
And this is the same as my command line tesseract: $ rpm -qf /lib64/libtesseract.so.5.3.4 tesseract-5.3.4-4.fc40.x86_64 $ rpm -qf /bin/tesseract tesseract-5.3.4-4.fc40.x86_64
The file is a jpeg picture taken on a phone, I've tried loading in Gimp, allowing conversion of the embedded colour profile and exporting as jpeg, tiff (lzw) and png. This changes the outputs slightly for both direct tesseract and gimagereader (png and tiff are identical), but the picture remains tesseract extracts a reasonable scan while gimagereader is producing mainly nonsense with a few patches of coherence.
It would be nice to be able to use gimagereader, since the layout detection would be handy (I've tried layout detection and removing any spurious selections and it outputs similar nonsensical output). Any ideas what might be going wrong here?