scribeocr / scribeocr

Web interface for recognizing text, proofreading OCR, and creating fully-digitized documents.
https://scribeocr.com
GNU Affero General Public License v3.0
82 stars 13 forks source link

Difference between scribeocr and vanilla Balearica version of Tesseract. #40

Open rmast opened 3 months ago

rmast commented 3 months ago

When I choose recognize with Dutch on this image:

https://user-images.githubusercontent.com/3341558/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg

the final sentence is correct, but a sentence starting with éé is completely missed.

image

Looking in the code the é is forcefully denied by the Scribe Tesseract build-version.

image

Choosing the Build Vanilla version of Tesseract gives this result:

image

één is correctly found, but the final sentence shows a Tesseract-bug that I wanted to investigate further: https://github.com/tesseract-ocr/tesseract/issues/3906

It is unclear to me which version of the original Tesseract is used for the naphta version of Tesseract, so I don't know what version to focus on to solve it.

It is unclear to me which Naphta version is used for the 'Vanilla Tesseract'.

Now I saw that you are also using the ScrollView to do the necessary debugging on Tesseract that the original Tesseract repo owners do not dare to do. Can you write a little instruction on how to debug tesseract with this javascript-version of ScrollView?

Balearica commented 3 months ago

The configuration setting removing é should be removed, good catch. This is not a difference between "builds" but rather is a configuration setting (using Tesseract CLI it would be -c tessedit_char_blacklist=|éï) that is always applied regardless of build. This was set when Scribe OCR had no "vanilla" option and only supported English. It makes no sense in the context of non-English languages.

I will write some documentation on how to use the ScrollView visualization features.

For organizational purposes, please try and keep every distinct bug report or feature request in exactly one issue. Including the discussion, this is the 3rd thread opened in the last week for discussing the differences between Tesseract builds.

rmast commented 3 months ago

Concerning Tesseract builds and double administrations, I guess you were also involved in the version 5 upgrade of the naphta-repo? Are you aware of an instruction for porting new versions of the original Tesseract, so we could eventually back-port solutions to unsolved bugs?

Verzonden vanaf Outlook voor Androidhttps://aka.ms/AAb9ysg


From: Balearica @.> Sent: Sunday, June 16, 2024 9:45:31 AM To: scribeocr/scribeocr @.> Cc: rmast @.>; Author @.> Subject: Re: [scribeocr/scribeocr] Difference between scribeocr and vanilla Balearica version of Tesseract. (Issue #40)

The configuration setting removing é should be removed, good catch. This is not a difference between "builds" but rather is a configuration setting (using Tesseract CLI it would be -c tessedit_char_blacklist=|éï) that is always applied regardless of build. This was set when Scribe OCR had no "vanilla" option and only supported English. It makes no sense in the context of non-English languages.

I will write some documentation on how to use the ScrollView visualization features.

For organizational purposes, please try and keep every distinct bug report or feature request in exactly one issue. Including the discussion, this is the 3rd thread opened in the last week for discussing the differences between Tesseract builds.

— Reply to this email directly, view it on GitHubhttps://github.com/scribeocr/scribeocr/issues/40#issuecomment-2171148628, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5XQ6TLTYHQ6X3GS5RDZHU7BXAVCNFSM6AAAAABJMK34HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGE2DQNRSHA. You are receiving this because you authored the thread.Message ID: @.***>

Balearica commented 3 months ago

I removed the é config option in the master branch, so that particular issue should be resolved.

Yes, I do maintain Tesseract.js, and do periodically update it to whatever the latest version of Tesseract is.

Balearica commented 3 months ago

The following documentation page describes how to enable the debugging visualizations from the ScrollView application.

https://docs.scribeocr.com/scrollview_debug.html

rmast commented 1 month ago

Anyway, I was able to rebuild the Vanilla version of Tesseract, by rolling back the non-vanilla changes in glue.js and glue.cpp, and altering the build-scripts/var.sh to do te vanilla-build.

I patched the vanilla-build with my patch for the word print. It now shows!

Schermafdruk van 2024-08-08 21-01-50

rmast commented 1 month ago

For some reason the change you did on the master branch almost two months ago doesn't show on scribeocr.com. I'll try to build it myself.

rmast commented 1 month ago

I can't get the blocked é to work with master. I'll apply my patch to make sure I'm mastering the whole pipeline, like clearing node_modules and browsercache.

Balearica commented 1 month ago

@rmast The version of Tesseract.js used by the web interface comes from the tess directory in this repo, which is not automatically updated. Any change to the files in the Tesseract.js-core repo that is not manually copy/pasted into this directory will not show up, which is presumably you're observing a difference.

This is not ideal, however allows for the site to be served almost instantly by CloudFlare (or users) without a build process or downloading additional dependencies. The reason why Tesseract.js is is node_modules is that the Scribe OCR Node.js interface uses that version.

rmast commented 1 month ago

"één" still not coming through from commit commit d015ef9afff63900d59b08291891fcbdc10f8c91, but my hold of the build-and-show-process with copying stuff from other repo's is clearly enough as I have the "print" on the right top fixed, which needs a browser cache refresh. Schermafdruk van 2024-08-09 09-53-43

Balearica commented 1 month ago

"één" still not coming through from commit commit https://github.com/scribeocr/scribeocr/commit/d015ef9afff63900d59b08291891fcbdc10f8c91, but my hold of the build-and-show-process with copying stuff from other repo's is clearly enough as I have the "print" on the right top fixed, which needs a browser cache refresh.

I misunderstood your earlier comment--given the title of the issue I thought you were referring to a commit made to Tesseract. Regarding the issue of één not being recognized, it looks like 2 of 3 instances of één in this image are now correctly identified following é being removed from the character blacklist. In the final instance, Tesseract is presumably incorrectly identifying these letters as images or noise.

rmast commented 1 month ago

I will just bluntly copy the vanilla versions over the scribeocr version to see whether it really is within Tesseract instead of in it's config. Line 75 of the LSTM and those other Tesseract-inputs only contain 'n' in the Chrome debugger.

rmast commented 1 month ago

And indeed, the scribeocr-version of Tesseract also contains a bias towards één. I'll try to build and debug that standalone.

rmast commented 1 month ago

git bisect has revealed the first commit biased towards "één" at the start of a sentence: https://github.com/Balearica/tesseract/commit/c646b3643719391aae924a53e7325c20268e4b9c In that commit the cause is finder = new ColumnFinder(static_cast(to_block->line_size / 2), blkbox.botleft(),

however, in the final version that division by 2 is not present anymore. It is replaced in https://github.com/Balearica/tesseract/commit/bbdaa8ca0791ae672bba797995f4142cb306633d by some other divisions