Inconsistent results - Githubissues

KarellenX commented 1 year ago

Hello,

I have been using your software for less than a year. I use it occasionally to OCR text from books.

The problem I encounter is such extreme and inconsistent results. It ranges from wow this is pretty good all the way to where did this nonsensical garbage come from. It also misses a lot of the text.

I guess I am doing something wrong, but I have no idea what! My scans seem pretty good and clear. I can drop the same image into a 2009 version of Readiris Pro11 and I have remarkably better results using it.

Is there some dummies guide on how to improve the results? Should scans be of a certain resolution, dpi?

Any feedback will be appreciated.

ocr1

KarellenX commented 1 year ago

Is this project not supported anymore?

Feli07notoldman commented 1 year ago

Hallo KarellenX Ich antworte hier in meiner Sprache. Google kann das leicht übersetzen. Ich arbeite mit dem Programm gImageReader erst seit einigen Tage, aber ich kann sagen, dass mich die Ergebnisse in "German Language" positiv überraschen.

Zu Ihrem Problem: Ich bin zu dem Schluss gekommen, dass es an der Qualität des Scans und an "eng.traineddata" liegen könnte.

Ihr Problem hat mich neugierig gemacht und ich habe Folgendes ausprobiert:

Den Screenshoot (JPG) habe ich etwas bearbeitet. Ich habe aber nur den seitenrelevanten zentralen Teil benutzt, der mit OCR bearbeitet werden soll. Mit dem Tool "ScanTailor Advanced" (zu finden über Google) habe ich das JPG in ein 300 DPI TIFF Bild umgewandelt. Darüber hinaus ist es ein kontrastreicheres Bild geworden. Anmerkung: Meine Bücher scanne ich mit 300 oder besser mit 600 DPI und speichere die Files als PNG (kein Qualitätsverlust). Dann bearbeite ich die scans mit ScanTailor Advanced um die Doppelseiten in Einzelseiten aufzuteilen.
Das Ergebnis mit gImageReader war wesentlich besser als das von Ihnen geschilderte Ergebnis.
Ich war noch nicht zufrieden und habe die Datei eng.traineddata ersetzt durch eng.traineddata (BEST). Die Datei ist zu finden unter https://github.com/tesseract-ocr/tessdata_best Ich habe eine portable Installation des Programms und bei mir ist der Speicherort zu finden unter C:\Users\Internet\AppData\Local\Programs\Tesseract-OCR\tessdata
Das Ergebnis der OCR war durchaus sehr gut. Kleine Fehler waren immer noch zu finden.

Sie sehen, das Ergebnis kann mit einer besseren scan wesentlich verbessert werden.

Out Liebe Grüsse

Feli07notoldman commented 1 year ago

Hier noch ein Auszug aus dem Scan, den ich für obiges Ergebnis verwendet habe Liebe Grüsse

Nachtrag: Es kann auch sein, dass das gescannte Buch einen Zeichensatz (eine Schriftart) verwendet, die tesseract nicht gut erkennen kann. das heisst, dass OCR bei manchen Büchern ein besseres Ergebnis liefert als bei anderen.

adrienbeau commented 1 year ago

I do all my text recognition at 300 dpi, which works fine for me.

I also used to have mixed results, things are better now. I am now very careful about having perfectly horizontal text. I find the slightest rotation can decrease text recognition dramatically.

I use the automatic rotation button of gImageReader, but I also zoom and then draw a selection rectangle over a wide area of text. I check that the bottom or top of the rectangle is precisely aligned with the baseline of the characters over the whole width of the page. I correct the rotation if this is not the case.

Feli07notoldman commented 1 year ago

The program ScanTailor Advanced, as I mention before in my first comment above, will do this rotation and much more automatically for me. E.g. rotation and deskewing, split of scan into two separately pictures, one for each book page. https://github.com/4lex4/scantailor-advanced It is my post processor for scans and can be used also as a preprocessor for any OCR program, to prepare the scans for better OCR results. You can use it as a batch processor for a bunch of scans. I am using it since some years.

KarellenX commented 1 year ago

Thank you for your advice @Feli07notoldman and @adrienbeau I will try out the methods you outlined and hopefully have consistently better results.

Out of curiosity, is this software still being maintained? I don't see a lot of activity here.

Regards

adrienbeau commented 1 year ago

Yes it is. It is a small, mature project, so don't expect a lot of activity.

KarellenX commented 1 year ago

Ok, thank you :)

KarellenX commented 1 year ago

Hello all, I just wanted to finish off and close this report with my experience after all the advice here and also at the mobilereads.com forum.

In case anybody else is a bit clueless like me, the following is what I did and the software I used. The results have improved dramatically, and I have no complaints.

I needed better scanner software for my WIA compliant scanner. I installed the following software. It is simple and easy to use. The BEST feature is that it can batch scan. Enter how many scans to make, how many seconds between scans (6 sec in my case) and press Start. All you need to worry about is turning pages in that 6 seconds... https://www.naps2.com/
And if you are after software for very fast and easy screen captures... https://github.com/greenshot/greenshot
Scan Tailor Advanced for fixing scanned images. Installed from this repo... https://github.com/4lex4/scantailor-advanced
Then this gImageReader for OCR.

Thanks again for pointing me in the right direction :)

manisandro / gImageReader

Inconsistent results #636

Nachtrag: Es kann auch sein, dass das gescannte Buch einen Zeichensatz (eine Schriftart) verwendet, die tesseract nicht gut erkennen kann. das heisst, dass OCR bei manchen Büchern ein besseres Ergebnis liefert als bei anderen.