tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.12k stars 9.39k forks source link

Duplicate Characters in Output Stream #2738

Open woodjohndavid opened 4 years ago

woodjohndavid commented 4 years ago

Please refer to the following link:

https://github.com/tesseract-ocr/tesseract/pull/2635

This concerns changes made to lstm_choices_mode.

Unless I misunderstand what these options are supposed to do, it appears like there is a bug or oversight. Please refer to this user area thread:

https://groups.google.com/forum/#!topic/tesseract-ocr/5tC6appoUgE

There seems to be no way to prevent lstm from including duplicates in the generated text and/or HOCR output. The example in the thread above is a clear example of this.

Surely there must be some way to force Tesseract to include only the highest confidence level choice of character when there are multiple possibilities.

Also, apologies if this is posted in the wrong place, and apologies for possible duplicate postings. I am a Tesseract newbie so trying to learn the ropes.

Thanks.

bertsky commented 4 years ago

IMO it's perfectly legitimate to raise this issue again here. It has already surfaced several times under different names and descriptions, e.g. #1465. The usual recommendation is to improve the model quality. And this does of course help in reducing the likelihood of this happening. But nevertheless the underlying flaw (and you could also call it a bug) in the basic CTC implementation is still there. And it is more likely to surface when decoding less probable output segments (as happens with lstm_choice_iterations BTW).

I have (tentatively) termed the phenomenon of fake CTC duplicates diplopia, and recommended using Equal Spacing CTC or similar as a mitigation.

woodjohndavid commented 4 years ago

Thanks for the response Bertsky. Hopefully someone will take a look at trying to fix this issue. In the meanwhile, what I have done is, using the character level HOCR output, implemented a scan of that output to identify characters whose box dimensions overlap 'significantly' and then select only the highest confidence level character from those duplicates.

Another small question: could you please tell me where to post issues (not just questions) about Tesseract? Is the Google tesseract-dev group active? My posting there received no response. Is this Github Issues section the right place?

bertsky commented 4 years ago

In the meanwhile, what I have done is, using the character level HOCR output, implemented a scan of that output to identify characters whose box dimensions overlap 'significantly' and then select only the highest confidence level character from those duplicates.

That's a very good workaround, and it would also work inside the beam decoder. It's only a question of finding the best parameter set (maximum confidence, minimum overlap absolute/relative) for different languages/scripts objectively (i.e. on large corpora)... But then again, if we had such a test system, we could quickly evaluate the impact of equal spacing CTC as well.

Another small question: could you please tell me where to post issues (not just questions) about Tesseract?

You are already in the right place for (possible) bugs and feature requests. As for mailing groups, I'm not qualified to answer that.

woodjohndavid commented 4 years ago

Hello again:

Well, it turns out that my workaround is not a good solution after all, as the character level box dimensions are not accurate in some cases. So this really needs to be promoted to being a bug of some kind, at least in so far as how the character level box dimensions are determined.

Attached is definitive proof of one case, although I have encountered many of them. This concerns the word "Cell" in the following sample image run through Tesseract. Attached are the following related files:

Sample Boxes Original.png - original image fed into Tesseract Sample Boxes HOCR.hocr - full HOCR output from Tesseract for that image Sample Box 1.png - screen shot taken of paint.net looking at box dimensions for letter 'C' Sample Box 2.png - screen shot taken of paint.net looking at box dimensions for letter 'e'

Following is the snippet from the HOCR specific for the word "Cell" which is on its own near the center of the original image.

  <span class='ocrx_word' id='word_1_37' title='bbox 1094 604 1153 655; x_wconf 95'>
   <span class='ocrx_cinfo' title='x_bboxes 1094 611 1117 640; x_conf 99.545456'>C</span>
   <span class='ocrx_cinfo' title='x_bboxes 1107 604 1124 655; x_conf 99.56794'>e</span>
   <span class='ocrx_cinfo' title='x_bboxes 1118 617 1137 640; x_conf 99.500481'>l</span>
   <span class='ocrx_cinfo' title='x_bboxes 1139 608 1153 640; x_conf 99.421089'>l</span>
  </span>

If you examine this case, you will see that the box dimensions for the letters 'C' and 'e' overlap significantly, hence resulting in my attempted workaround for removing duplicates to remove the letter 'C' from my output. However, if you actually look at the boxes on the source image (see my paint.net screen shots) you will see that the box for the letter 'e' simply makes no sense and cannot possibly be what Tesseract used to extract the letter 'e' with a confidence level of 99.56.

I have encountered many such examples, a lot of them where the box dimensions used to correctly select a particular character cover an area which includes the previous or next character as well.

Sample Box 1 Sample Box 2 Sample Boxes Original Sample Boxes HOCR.txt

bertsky commented 4 years ago

Thanks @woodjohndavid for providing details. I can confirm this with the current master. Here are all the boxes of that word:

boxes-2738 box

boxes-2738 dbg

That's clearly a bug.

Looking at the debug log with -c classify_debug_level=1, I see...

Processing word with lang eng at:Bounding box=(1094,530)->(1153,562)
Trying word using lang eng, oem 2
<null>=110 On [0, 2), scores= 100(i=83=0,00107) 99,9(C=1=0,0548), Mean=99,9364, max=99,9939
C=1 On [2, 6), scores= 92,9(<null>=110=7,04) 99,9(<null>=110=0,102) 0,401(<null>=110=99,6) 1,47e-05(<null>=110=99,8), Mean=48,3003, max=99,8814
e=90 On [6, 9), scores= 92,3(<null>=110=7,64) 99,9(<null>=110=0,0713) 12,9(<null>=110=87,1), Mean=68,3886, max=99,9144
l=87 On [9, 13), scores= 1,02(<null>=110=99) 97,7(<null>=110=2,32) 98(<null>=110=1,92) 2,64(<null>=110=97,4), Mean=49,8415, max=98,0281
l=87 On [13, 16), scores= 30,8(<null>=110=69,2) 99,9(|=59=0,0643) 1,06(<null>=110=98,9), Mean=43,8997, max=99,8603

...(from LSTMRecognizer::LabelsFromOutputs / DebugActivationPath), and its underlying pixel-wise sequence...

0 null_char score=-0,191388, c=-0,191388, perm=2, hash=0
1 null_char score=-0,385364, c=-0,193976, perm=2, hash=0 prev:null_char score=-0,191388, c=-0,191388, perm=2, hash=0
2 label=1, uid=3=C [43 ]A score=-0,577528, c=-0,192164, perm=2, hash=1 prev:null_char score=-0,385364, c=-0,193976, perm=2, hash=0
3 label=1, uid=3=C [43 ]A score=-0,771448, c=-0,19392, perm=2, hash=1 prev:label=1, uid=3=C [43 ]A score=-0,577528, c=-0,192164, perm=2, has h=1
4 label=1, uid=3=C [43 ]A score=-0,96271, c=-0,191262, perm=2, hash=1 prev:label=1, uid=3=C [43 ]A score=-0,771448, c=-0,19392, perm=2, hash =1
5 null_char score=-1,15898, c=-0,196274, perm=2, hash=1 prev:label=1, uid=3=C [43 ]A score=-0,96271, c=-0,191262, perm=2, hash=1
6 label=90, uid=92=e [65 ]a score=-1,3505, c=-0,191512, perm=2, hash=c9 prev:null_char score=-1,15898, c=-0,196274, perm=2, hash=1
7 label=90, uid=92=e [65 ]a score=-1,54367, c=-0,193177, perm=2, hash=c9 prev:label=90, uid=92=e [65 ]a score=-1,3505, c=-0,191512, perm=2, hash=c9
8 label=90, uid=92=e [65 ]a score=-1,73536, c=-0,191687, perm=2, hash=c9 prev:label=90, uid=92=e [65 ]a score=-1,54367, c=-0,193177, perm=2, hash=c9
9 label=87, uid=89=l [6c ]a score=-1,92683, c=-0,191467, perm=2, hash=577e prev:label=90, uid=92=e [65 ]a score=-1,73536, c=-0,191687, perm=2, hash=c9
10 label=87, uid=89=l [6c ]a score=-2,17104, c=-0,244217, perm=2, hash=577e prev:label=87, uid=89=l [6c ]a score=-1,92683, c=-0,191467, perm=2, hash=577e
11 label=87, uid=89=l [6c ]a score=-2,36344, c=-0,192399, perm=2, hash=577e prev:label=87, uid=89=l [6c ]a score=-2,17104, c=-0,244217, perm=2, hash=577e
12 null_char score=-2,61503, c=-0,251586, perm=2, hash=577e prev:label=87, uid=89=l [6c ]a score=-2,36344, c=-0,192399, perm=2, hash=577e
13 label=87, uid=89=l [6c ]a score=-2,80721, c=-0,192181, perm=2, hash=25eff9 prev:null_char score=-2,61503, c=-0,251586, perm=2, hash=577e
14 label=87, uid=89=l [6c ]a score=-3,0016, c=-0,194395, perm=2, hash=25eff9 prev:label=87, uid=89=l [6c ]a score=-2,80721, c=-0,192181, perm=2, hash=25eff9
15 label=87, uid=89=l [6c ]a score=-3,19285, c=-0,19125, perm=2, hash=25eff9 prev:label=87, uid=89=l [6c ]a score=-3,0016, c=-0,194395, perm=2, hash=25eff9

...(from RecodeBeamSearch::DebugPath) which looks fine. But it's not what we see above as segmentation, and that derives from Tesseract::SearchWords.

@stweil, do you think this could be related to your and Noah's fixes in #2576?

woodjohndavid commented 4 years ago

Thanks Bertsky for confirming the issue.

As a Tesseract newbie, could I impose upon you yet again to give me some idea of when and how bugs are prioritized and potentially worked on? Is there any Tesseract development activity actually underway at this point?

I understand fully that Tesseract is open source, and hence I have no basis for any expectations whatsoever. But I would like to understand what the current state of development activity is.

I doubt that I have the necessary technical skills to contribute to Tesseract development, but would be interested to know how one gets involved in that if one chooses.

Thanks in advance for whatever light you can shed on this for me.

bertsky commented 4 years ago

@woodjohndavid I can only give you my personal impression on the questions you just raised. This is obviously a diverse and open community, perspectives and circumstances of contributers/developers vary substantially.

What gets done how soon depends on many things, notably:

For current development efforts, cf. https://github.com/tesseract-ocr/tesseract/wiki/Planning.

If you want to contribute yourself,

woodjohndavid commented 4 years ago

OK thanks @bertsky, much appreciated. I realize also that this is not the right forum for these kind of learning questions, but I have had little luck in getting anyone else to respond to them. So just one more, if you would be so kind: is there a leader or manager of the code base responsible for some kind of vetting of contributions before they enter the main code branch? If so, who?

Thanks again.

bertsky commented 4 years ago

There are people here with write permissions, but the reviewing work itself is usually shared. You can find more out by looking at the closed PRs or the contributer list.

woodjohndavid commented 4 years ago

Is there any likelihood that the issue of inaccurate character level bounding box dimensions will be addressed sometime soon? Of course, the real underlying issue is that the Tesseract LSTM engine is including multiple alternative characters in the output stream. However, it seems likely that the latter issue would be harder to correct. If the character level box dimensions could be made accurate, then the workaround that I proposed earlier in this thread for the duplicate character issue would in fact work.

RicketyRick commented 4 years ago

I second this. The wrong character level box stops us and our partner companies to use tesseract and we need to subscribe to these bad and expensive APIs of Abbyy and OmniPage. I would rather use Tesseract.

stweil commented 4 years ago

@woodjohndavid, @RicketyRick, the development process is currently entirely community driven. Code changes are provided by volunteers who might have other priorities than you.

So it is up to you to find and suggest a solution by providing a pull request - unless someone else does it.

RicketyRick commented 4 years ago

@stweil thank you, I will try, but the codebase is really big. Is there any help to find a short cut to the sources that might be of interest concerning the bounding box issue?

woodjohndavid commented 4 years ago

There are numerous overlapping issues that have been raised related to this same subject. In perusing a few of them, the names that come up frequently include @Sintun @theraysmith @jbreiden @stweil @noahmetzger who seem to be knowledgeable in this area of functionality and code. Perhaps those gentlemen could give some direction on where to look in the code.

This seems to be directly related to https://github.com/tesseract-ocr/tesseract/pull/2576

ghost commented 4 years ago

Hi,

I believe I have the same issue with the following style of input tables : tessinput-columns

Tesseract gives good enough results with -psm 6 ( except it doesn't skip the divider bars of the table, so I have to clean up with sed to delete all [ | \ { and the such that it adds in the middle of the data...) Surprisingly, if I run tesseract after first cleaning up the image to remove the table separators, the results are not as good, and tesseract mixes up Os and 0s, which it doesn't do if I leave the vertical bars.. In all cases though, I randomly get double characters (O0 or OQ etc) when tesseract isn't sure which it is. If I run with hocr, all the random characters are associated with very low wconf.

While waiting for a fix, is there any way to teach tesseract the structure of each line ? All lines are the same and columns can only contain one type of data, digits, or characters...

Thank you very much and have a good day

bertsky commented 4 years ago

@clavelc, most of what you say is not related – please help keeping issues to the point!

I have to clean up with sed to delete all [ | \ { and the such that it adds in the middle of the data...)

You can do that easier with a parameter: tesseract -c tessedit_char_blacklist="[|\\{" (or SetVariable() in API).

Surprisingly, if I run tesseract after first cleaning up the image to remove the table separators, the results are not as good

Yes, your columns are very close to each other, so the lines should help.

While waiting for a fix, is there any way to teach tesseract the structure of each line ? All lines are the same and columns can only contain one type of data, digits, or characters...

Yes you can: for this kind of table, you can easily use the --user-patterns feature of the CLI (see man-page and wiki). For more complicated cases, you can always do segmentation separately, then crop segment images, and run in PSM_SINGLE_LINE with very strict user patterns or char whitelist.

ghost commented 4 years ago

Thanks for your answer,

please help keeping issues to the point!

Sorry for that, will do !

you can easily use the --user-patterns feature of the CLI

Thanks for the tip, I looked up --user-pattern the other day but couldn't figure out how to apply it to my table. I'll try again.

Have a good day

woodjohndavid commented 3 years ago

I have downloaded the latest master code branch version and am experimenting with the code under Ubuntu on two fronts:

  1. The original purpose of this thread, which is the inclusion of multiple characters in the output feed for what is essentially the same character position in the incoming image. Interestingly enough, the current version from master is somewhat improved in this regard, as some samples of this problem from earlier on using Tesseract Windows version tesseract-ocr-w64-setup-v5.0.0-alpha.20191030 seem to be working now. However, I have searched the pull request log and see nothing in there that would seem to be related to correcting this issue, and there are still some of my test cases which demonstrate the problem. See the attachment for one small sample of same for which the latest master Tesseract comes up with '10of3'. I am intending to work with the fix suggested in issue #3144 to see if that may be a path forward. I could be wrong, but progress seems to have stalled at this point on that thread.

  2. The secondary purpose which this thread morphed into, which is the issue with the inaccuracy of the character level box dimensions in the HOCR output when using the LSTM engine. I had attempted a workaround for the multiple character problem by using those box dimensions to identify characters with more-or-less the same image position, and to try and select the one with the highest confidence level. However, this workaround turns out to be a non-starter since the box dimensions cannot be relied upon. I have done some investigation on this subject, and will create a separate issue dedicated to that problem. However, this is largely a red herring in my situation, since the only reason I cared about it was if it could be used to solve the multiple overlapping character problem.

OneOfThree

woodjohndavid commented 6 months ago

I have just created pull request https://github.com/tesseract-ocr/tesseract/pull/4211 which I consider to be an improved solution for diplopia.

I encourage everyone on this trail to try this out and test it with as broad a range of cases as possible.

Note by the way, there are some new configuration values that can only be set in code as things stand. These configuration values are:

bool kRemoveDiplopia - if true, enables diplopia removal functionality. If false, my changes have no effect int kMaxDiplopiaGap - maximum number of timesteps apart to be considered diplopia, default 2

Obviously if my diplopia change is of value, then these configuration items should be made into settings.