naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.91k stars 2.21k forks source link

Character level recognition gives the same results as the word level recognition. #877

Closed Kishlay-notabot closed 8 months ago

Kishlay-notabot commented 8 months ago

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo)
Latest release version 5.0.4
Describe the bug
A clear and concise description of what the bug is.
Running Tesseract.js code in 2 different PSM modes gives the same output.
Is tesseract configured to give word level outputs only?
Am I guessing it right that PSMs just refine the recognition scope, but do not affect the output because it will always will be in words?
Running in SINGLE_CHAR and PSM_SINGLE_WORD gives the same output from the same sample.
I want to sort the result character by character and in order to do that, I want the bbox data of each character detected to be extracted, and used further. Is this possible?

Device Version:

Balearica commented 8 months ago

Page segmentation mode (PSM) has no impact on the format or level of granularity of the output. Running with PSM SINGLE_WORD tells the Tesseract "I believe the input image contains a single word," and running with SINGLE_CHAR tells Tesseract "I believe the input image contains a single character."

If you want more granular output with character-level bounding boxes, look at the blocks output format.

Kishlay-notabot commented 8 months ago

Thankyou for giving an insight, will close after experimenting

o7