Open rmast opened 3 months ago
The configuration setting removing é
should be removed, good catch. This is not a difference between "builds" but rather is a configuration setting (using Tesseract CLI it would be -c tessedit_char_blacklist=|éï
) that is always applied regardless of build. This was set when Scribe OCR had no "vanilla" option and only supported English. It makes no sense in the context of non-English languages.
I will write some documentation on how to use the ScrollView visualization features.
For organizational purposes, please try and keep every distinct bug report or feature request in exactly one issue. Including the discussion, this is the 3rd thread opened in the last week for discussing the differences between Tesseract builds.
Concerning Tesseract builds and double administrations, I guess you were also involved in the version 5 upgrade of the naphta-repo? Are you aware of an instruction for porting new versions of the original Tesseract, so we could eventually back-port solutions to unsolved bugs?
Verzonden vanaf Outlook voor Androidhttps://aka.ms/AAb9ysg
From: Balearica @.> Sent: Sunday, June 16, 2024 9:45:31 AM To: scribeocr/scribeocr @.> Cc: rmast @.>; Author @.> Subject: Re: [scribeocr/scribeocr] Difference between scribeocr and vanilla Balearica version of Tesseract. (Issue #40)
The configuration setting removing é should be removed, good catch. This is not a difference between "builds" but rather is a configuration setting (using Tesseract CLI it would be -c tessedit_char_blacklist=|éï) that is always applied regardless of build. This was set when Scribe OCR had no "vanilla" option and only supported English. It makes no sense in the context of non-English languages.
I will write some documentation on how to use the ScrollView visualization features.
For organizational purposes, please try and keep every distinct bug report or feature request in exactly one issue. Including the discussion, this is the 3rd thread opened in the last week for discussing the differences between Tesseract builds.
— Reply to this email directly, view it on GitHubhttps://github.com/scribeocr/scribeocr/issues/40#issuecomment-2171148628, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5XQ6TLTYHQ6X3GS5RDZHU7BXAVCNFSM6AAAAABJMK34HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGE2DQNRSHA. You are receiving this because you authored the thread.Message ID: @.***>
I removed the é
config option in the master branch, so that particular issue should be resolved.
Yes, I do maintain Tesseract.js, and do periodically update it to whatever the latest version of Tesseract is.
The following documentation page describes how to enable the debugging visualizations from the ScrollView application.
Anyway, I was able to rebuild the Vanilla version of Tesseract, by rolling back the non-vanilla changes in glue.js and glue.cpp, and altering the build-scripts/var.sh to do te vanilla-build.
I patched the vanilla-build with my patch for the word print. It now shows!
For some reason the change you did on the master branch almost two months ago doesn't show on scribeocr.com. I'll try to build it myself.
I can't get the blocked é to work with master. I'll apply my patch to make sure I'm mastering the whole pipeline, like clearing node_modules and browsercache.
@rmast The version of Tesseract.js used by the web interface comes from the tess directory in this repo, which is not automatically updated. Any change to the files in the Tesseract.js-core repo that is not manually copy/pasted into this directory will not show up, which is presumably you're observing a difference.
This is not ideal, however allows for the site to be served almost instantly by CloudFlare (or users) without a build process or downloading additional dependencies. The reason why Tesseract.js is is node_modules
is that the Scribe OCR Node.js interface uses that version.
"één" still not coming through from commit commit d015ef9afff63900d59b08291891fcbdc10f8c91, but my hold of the build-and-show-process with copying stuff from other repo's is clearly enough as I have the "print" on the right top fixed, which needs a browser cache refresh.
"één" still not coming through from commit commit https://github.com/scribeocr/scribeocr/commit/d015ef9afff63900d59b08291891fcbdc10f8c91, but my hold of the build-and-show-process with copying stuff from other repo's is clearly enough as I have the "print" on the right top fixed, which needs a browser cache refresh.
I misunderstood your earlier comment--given the title of the issue I thought you were referring to a commit made to Tesseract. Regarding the issue of één
not being recognized, it looks like 2 of 3 instances of één
in this image are now correctly identified following é
being removed from the character blacklist. In the final instance, Tesseract is presumably incorrectly identifying these letters as images or noise.
I will just bluntly copy the vanilla versions over the scribeocr version to see whether it really is within Tesseract instead of in it's config. Line 75 of the LSTM and those other Tesseract-inputs only contain 'n' in the Chrome debugger.
And indeed, the scribeocr-version of Tesseract also contains a bias towards één. I'll try to build and debug that standalone.
git bisect has revealed the first commit biased towards "één" at the start of a sentence:
https://github.com/Balearica/tesseract/commit/c646b3643719391aae924a53e7325c20268e4b9c
In that commit the cause is finder = new ColumnFinder(static_cast
however, in the final version that division by 2 is not present anymore. It is replaced in https://github.com/Balearica/tesseract/commit/bbdaa8ca0791ae672bba797995f4142cb306633d by some other divisions
When I choose recognize with Dutch on this image:
https://user-images.githubusercontent.com/3341558/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg
the final sentence is correct, but a sentence starting with éé is completely missed.
Looking in the code the é is forcefully denied by the Scribe Tesseract build-version.
Choosing the Build Vanilla version of Tesseract gives this result:
één is correctly found, but the final sentence shows a Tesseract-bug that I wanted to investigate further: https://github.com/tesseract-ocr/tesseract/issues/3906
It is unclear to me which version of the original Tesseract is used for the naphta version of Tesseract, so I don't know what version to focus on to solve it.
It is unclear to me which Naphta version is used for the 'Vanilla Tesseract'.
Now I saw that you are also using the ScrollView to do the necessary debugging on Tesseract that the original Tesseract repo owners do not dare to do. Can you write a little instruction on how to debug tesseract with this javascript-version of ScrollView?