tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.22k stars 9.51k forks source link

Tesseract 4.1 incorrect symbol bounding rectangle coordinates #2636

Open romanchetto opened 5 years ago

romanchetto commented 5 years ago

Hello everyone.

After upgrading from tesseract v.4.0 to 4.1 I have faced with the next issue: sometimes symbols in words are swapped. I've found out that returned text value and bounding rectangle from word result iterator are OK. But when I collect problem word symbols from symbol iterator, I've found out that X-coordinate and width are sometimes incorrect:

tess4 1issue

On this screenshot you can see that symbol "m" goes before "A" in the word "American", and its width is twice longer that average symbol length in the word

Tesseract 4.1 release notes says: "Fix for bounding box problem." Maybe this fix somehow relates to this issue.


Environment

Current Behavior:

Incorrect symbol bounding rectagle value, if order by X-coordinate symbols are swapped

Expected Behavior:

All symbol bounding rectangle values are correct. If order by X-coordinate word symbols are in corresponding order, like in word text value

stweil commented 5 years ago

@noahmetzger, could you please test that with the latest code?

noahmetzger commented 5 years ago

With the new algorithm its definitly better, but still far away from perfect.

Version 4.1 americanBoxes

Version 5 americanBoxesNew

romanchetto commented 5 years ago

Hello @noahmetzger , @stweil . Is version 5 - current "master" and 4.1 - July release? By the way, could you please try on version 4.0 to compare with 4.1 and 5? On version 4.0 I don't face with this issue

noahmetzger commented 5 years ago

If we are talking about the commits 4.1: 5280bbcade4e2dec5eef439a6e189504c2eadcd9 and 4.0: c69859cacb040a518cd64206ab1a2d6e48d17854

for me the bounding boxes are completely identical

romanchetto commented 5 years ago

@noahmetzger , Sorry, under 4.0 I have meant Release 4.0.0 from 29 October 2018, commit 5131699

noahmetzger commented 5 years ago

@romanchetto your right 4.0 had better bounding boxes compared to 4.1.

compared to 5.0 its hard to say which one is better. But look for yourself. Here is the outcome of 4.0 americanBoxes4 0Release

stweil commented 5 years ago

I tried to bisect this. It looks like the regression was introduced by commit ce88adbf326a40b08de32e35eafffd29ef43290e.

romanchetto commented 5 years ago

@stweil , @noahmetzger, Thank you for your help. I will think how to handle this from my side, or just wait for 5.0 release.

shermanrxie commented 4 years ago

We got the same issue, and the result of 4.1 is more worse then 4.0. We plan rollback to version 4.0. And hope it can be resolved in 5.0.

zdenop commented 4 years ago

@shermanrxie : your comment with testing image is useless, and hope that somebody fit it in 5.0 without testing case has no meaning.