Difference between scribeocr and vanilla Balearica version of Tesseract.

rmast commented 3 months ago

When I choose recognize with Dutch on this image:

https://user-images.githubusercontent.com/3341558/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg

the final sentence is correct, but a sentence starting with éé is completely missed.

Looking in the code the é is forcefully denied by the Scribe Tesseract build-version.

Choosing the Build Vanilla version of Tesseract gives this result:

één is correctly found, but the final sentence shows a Tesseract-bug that I wanted to investigate further: https://github.com/tesseract-ocr/tesseract/issues/3906

It is unclear to me which version of the original Tesseract is used for the naphta version of Tesseract, so I don't know what version to focus on to solve it.

It is unclear to me which Naphta version is used for the 'Vanilla Tesseract'.

Now I saw that you are also using the ScrollView to do the necessary debugging on Tesseract that the original Tesseract repo owners do not dare to do. Can you write a little instruction on how to debug tesseract with this javascript-version of ScrollView?

Balearica commented 3 months ago

The configuration setting removing é should be removed, good catch. This is not a difference between "builds" but rather is a configuration setting (using Tesseract CLI it would be -c tessedit_char_blacklist=|éï) that is always applied regardless of build. This was set when Scribe OCR had no "vanilla" option and only supported English. It makes no sense in the context of non-English languages.

I will write some documentation on how to use the ScrollView visualization features.

For organizational purposes, please try and keep every distinct bug report or feature request in exactly one issue. Including the discussion, this is the 3rd thread opened in the last week for discussing the differences between Tesseract builds.

rmast commented 3 months ago

Concerning Tesseract builds and double administrations, I guess you were also involved in the version 5 upgrade of the naphta-repo? Are you aware of an instruction for porting new versions of the original Tesseract, so we could eventually back-port solutions to unsolved bugs?

Verzonden vanaf Outlook voor Androidhttps://aka.ms/AAb9ysg

From: Balearica @.> Sent: Sunday, June 16, 2024 9:45:31 AM To: scribeocr/scribeocr @.> Cc: rmast @.>; Author @.> Subject: Re: [scribeocr/scribeocr] Difference between scribeocr and vanilla Balearica version of Tesseract. (Issue #40)

The configuration setting removing é should be removed, good catch. This is not a difference between "builds" but rather is a configuration setting (using Tesseract CLI it would be -c tessedit_char_blacklist=|éï) that is always applied regardless of build. This was set when Scribe OCR had no "vanilla" option and only supported English. It makes no sense in the context of non-English languages.

I will write some documentation on how to use the ScrollView visualization features.

For organizational purposes, please try and keep every distinct bug report or feature request in exactly one issue. Including the discussion, this is the 3rd thread opened in the last week for discussing the differences between Tesseract builds.

— Reply to this email directly, view it on GitHubhttps://github.com/scribeocr/scribeocr/issues/40#issuecomment-2171148628, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5XQ6TLTYHQ6X3GS5RDZHU7BXAVCNFSM6AAAAABJMK34HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGE2DQNRSHA. You are receiving this because you authored the thread.Message ID: @.***>

Balearica commented 3 months ago

I removed the é config option in the master branch, so that particular issue should be resolved.

Yes, I do maintain Tesseract.js, and do periodically update it to whatever the latest version of Tesseract is.

Balearica commented 3 months ago

The following documentation page describes how to enable the debugging visualizations from the ScrollView application.

https://docs.scribeocr.com/scrollview_debug.html