tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.06k stars 9.39k forks source link

output of LSTM choices only works when model has dicts #3291

Open bertsky opened 3 years ago

bertsky commented 3 years ago

I wrote a little proof of concept to debug and visualise the LSTM matrix output written by @noahmetzger – it looks like this:

tiled

The tool builds upon the hocr renderer's ocrx_cinfo output enabled under lstm_choice_mode=1:

snipped hOCR output of first word with LSTM choices

```HTML 3269. 3 3 3 3 2 2 2 2 6 6 8 6 8 9 9 9 9 . . . Byron's, ```

Now, I have played around with it a bit and it struck me that there will be no choices and timesteps for many words when I use only models from tesstrain, or more precisely, models without DAWGs/dicts:

tesseract image.png image_tess -c lstm_choice_mode=1 -c lstm_choice_iterations=0 --psm 7 --oem 1 -l GT4HistOCR+ONB hocr txt
Failed to load any lstm-specific dictionaries for lang GT4HistOCR!!
Failed to load any lstm-specific dictionaries for lang ONB!!
snipped hOCR output of first 3 words, only the 3rd with LSTM choices

```HTML 3269. Byron's, Lord, ```

Obviously, my tool does not work under these circumstances, and neither does any other use-case for ChoiceIterator (character hypothesis) output.

(And no, it does not work when setting lstm_choice_iterations>0 or lstm_choice_mode=2 either. The latter uses GetBestLSTMSymbolChoices, whereas mode=1 uses GetRawLSTMTimesteps. Both feed from LSTMRecognizer::RecognizeLine's calls to RecodeBeamSearch::extractSymbolChoices.)

So the implementation seems broken. It does not work on predictions not delimited by dawg matches.

Incidentally, in trying to isolate that problem, I tried to use load_number_dawg=0, load_punc_dawg=0 etc. with pretrained LSTM models. But these dicts were still loaded! (I should have generalized that part as well when I allowed passing some of the cmdline params to the LSTM instance in 297d7d86c.)

What should we do now, @stweil?

amitdo commented 3 years ago

Try to create a dummy dawg from a wordlist of just 1 or two words (1 or 2 lines). Add it to the model and tell us if this hack solves the issue.

bertsky commented 3 years ago

Try to create a dummy dawg from a wordlist of just 1 or two words (1 or 2 lines). Add it to the model and tell us if this hack solves the issue.

I did:

combine_tessdata -e GT4HistOCR.traineddata
wordlist2dawg <(echo Deutſche) test.dawg GT4HistOCR-tessdata.lstm-unicharset
combine_tessdata -o GT4HistOCR.traineddata test.dawg
mv GT4HistOCR.traineddata GT4HistOCR_worddawg.traineddata

Now when running with -l GT4HistOCR_worddawg+ONB it does give me:

loading lstm-word-dawg
Failed to load any lstm-specific dictionaries for lang ONB!!

And then I do indeed see LSTM choices for that word (and a correction recognition result Deutſche – it was Oeutſche before). But the others stay empty.

stweil commented 3 years ago

there will be no choices and timesteps for many words

Are there choices for some words, or are they missing for any word?

bertsky commented 3 years ago

Are there choices for some words, or are they missing for any word?

The former – see second hOCR snippet posted above.

bertsky commented 3 years ago

I forgot to properly distinguish between output from GetBestLSTMSymbolChoices (mode=2) / GetRawLSTMTimesteps (mode=1) on the one side, and output from ChoiceIterator (the old API) on the other: Only the former show the above mentioned behaviour, but the latter never yields any non-best output.

So it would seem that the generation of hypotheses is broken because it depends on the dawgs, and the output towards the old API is completely dysfunctional.

EDIT Bollocks, I looked the wrong way this time. ChoiceIterator behaves just the same way as GetBestLSTMSymbolChoices.

Sorry for the noise!

wollmers commented 3 years ago

Interesting.

I wonder a little bit, why the empty character appears so often with high confidence:

       <span class='ocr_symbol' id='symbol_1_1_1'>
        <span class='ocrx_cinfo' id='timestep1_1_1'>
         <span class='ocrx_cinfo' id='choice_1_1_1' title='x_confs 99'></span></span><!-- EMPTY -->
        <span class='ocrx_cinfo' id='timestep1_1_2'>
         <span class='ocrx_cinfo' id='choice_1_1_2' title='x_confs 99'></span></span><!-- EMPTY -->
        <span class='ocrx_cinfo' id='timestep1_1_3'>
         <span class='ocrx_cinfo' id='choice_1_1_3' title='x_confs 86'></span><!-- EMPTY -->
         <span class='ocrx_cinfo' id='choice_1_1_4' title='x_confs 13'>3</span></span>
        <span class='ocrx_cinfo' id='timestep1_1_4'>
         <span class='ocrx_cinfo' id='choice_1_1_5' title='x_confs 99'>3</span></span>
        <span class='ocrx_cinfo' id='timestep1_1_5'>
         <span class='ocrx_cinfo' id='choice_1_1_6' title='x_confs 99'>3</span></span>
        <span class='ocrx_cinfo' id='timestep1_1_6'>
         <span class='ocrx_cinfo' id='choice_1_1_7' title='x_confs 72'></span><!-- EMPTY -->
         <span class='ocrx_cinfo' id='choice_1_1_8' title='x_confs 27'>3</span></span>
        <span class='ocrx_cinfo' id='timestep1_1_7'>
         <span class='ocrx_cinfo' id='choice_1_1_9' title='x_confs 99'></span></span></span><!-- EMPTY -->

In case of numbers it's not easy to check choices in post-processing without bounding-boxes on character level.

BTW, I get perfect results with other (two different) trained models:

3269. Byron's, Lord, ſaͤmmtl. Werke. Jns Deutſche uͤberſ. v. Mehreren.
3269. Byron's, Lord, ſaͤmmtl. Werke. Ins Deutſche uͤberſ. v. Mehreren.

Both don't have any lstm-specific dictionaries. The J/I difference is a matter of taste without knowledge of the publication date or the complete font.

AFAIR lookups in DAWG where slow in v3. It's questionable to include wordlists into the trained models, especially different orthographies of many centuries. Seems a problem of e. g. lat.traineddata.

bertsky commented 3 years ago

Welcome to the discussion, @wollmers!

I wonder a little bit, why the empty character appears so often with high confidence:

What you see is the output of lstm_choice_mode=1 (i.e. timesteps, without CTC). So empty outputs denote (possible) symbol separators ("null char").

BTW, I get perfect results with other (two different) trained models: Both don't have any lstm-specific dictionaries.

No doubt, but this issue is not about quality of models (or single-best paths), but about being able to see the alternatives (for various use cases, like post-correction, keyword search, grammar decoding etc).

AFAIR lookups in DAWG where slow in v3. It's questionable to include wordlists into the trained models, especially different orthographies of many centuries.

That may well be. I don't want to argue in favour or disfavour of model-embedded DAWGs here. (I can see your point regarding incomplete vocabulary and historic orthographies, but Tesseract's so-called language models are only a hint/rescoring, not a constraint. And they do usually help at least for the more common known words, numbers, punctuation. In particular if you compare with non-dawg versions of the same model, like frk stripped of its dawgs. I guess it depends on the use-case.)

But again, the issue is why Tesseract does not produce any choice output if there are no DAWGs. That's a bug IMO. (Especially since the use-case would typically be that language models are then used externally.)

Seems a problem of e. g. lat.traineddata.

You mean that it does not know ſ and most ligatures? (That's a question of its unicharset though.)