tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.2k stars 9.51k forks source link

RFC: Lattice Output #2339

Open bertsky opened 5 years ago

bertsky commented 5 years ago

I would like to propose an extension to the API (and CLI) for producing a character lattice (directed acyclic character hypotheses graph) as recognition result with LSTM models.

Rationale

For some applications, such as LM rescoring, OCR post-correction, multi-OCR alignment, keyword spotting, FST decoding etc, it is not enough to produce only the best path from CTC decoding, or even the n best paths of the beam, or a sequence of alternatives (aka confusion network or sausage lattice): they all throw away information on the segmentation ambiguity, which is important to preserve for the aforementioned tasks. Representing the OCR search space by a character lattice is versatile, but relieves the user from requiring knowledge of CTC.

This is especially important in Tesseract, because its LSTM decoder contains a specially optimised form of CTC not easy to get by:

Export of the full character lattice would encapsulate those intricacies and dependencies, while still allowing to integrate downstream applications tightly with OCR.

Implementation

In principle, inside src/lstm/recodebeam.cpp, it is not difficult to transform the resulting beam search trie at the last horizontal position into a lattice (with converging edges): One can traverse each node backwards along its path, generating vertices (for positions) and edges (for characters), but stopping at previously seen prefix paths (merely adding fan-out), and joining previously generated suffix paths (adding fan-in). The data structure of the beam elements, RecodeNode, with its linked list (prev) and prefix hash (code_hash), makes this straightforward. Also, one can easily re-use the Trie class as lattice data type (perhaps as a new DawgType=DAWG_TYPE_HYPOTHESES) by extension with public members for insertion of nodes and edges. Besides the UNICHAR_ID characters themselves, the edges have to carry the confidence scores, and the nodes have to carry coordinates. Alternatively, one could write some simple STL-based representation from scratch. (One could probably even employ a FST representation, although it would be more difficult to preserve the coordinates there.)

I already wrote a proof of concept for this, which generates a lattice based on Trie (though still without confidence and coords), and can also output a visualization via Graphviz. I can share it, if anyone is interested. This is what the results typically look like:

alexis_ruhe01_1852_0035_019

alexis_ruhe01_1852_0035_019 dot

Problems

Obviously, the lattices thus created are still too narrow, because the beam itself is too narrow. So perhaps lattice extraction after beam search is already too late: One could try to join hypotheses falling off the beam with neighbouring nodes (without much extra cost, but requiring to fuse APIs for recognition and extraction).

And there is yet another problem even more severe: vanilla CTC cannot reliably discern between true repetitions of a character in the input and fake duplicates which result from alignment paths in the presence of competing null output. For example, node 1 above at t=311 comes from paths with two successive hyphens (with null in between) instead of one. This "diplopia" phenomenon occurs rather rarely when only the best path is to be extracted – but is not unknown, especially with suboptimal models. However, it can be expected to become overwhelming when extracting deeper alternatives and looking at a wider beam. I fail to see any simple measures to counter that (besides limited heuristics like suppressing single-pixel output hypotheses). One could adopt enhanced approaches like Equal Spacing CTC, which might be superior anyway. But this would be quite an undertaking.

Conclusion

Should Tesseract provide such a feature? What is the right approach for lattice extraction? How can the diplopia problem be addressed? What API / which data structures would be preferable (without introducing too many dependencies but being maximally reusable, perhaps even in Python)?

BTW, there were some efforts in that direction in ocropy as well.

@noahmetzger @stweil @kba Correlates with this PAGE-XML extension proposal

adelra commented 5 years ago

This a great idea and it would be useful in debugging and increasing the accuracy of the OCR. Also, using existing lattice formats like Kaldi, would help so much because there are a lot of existing tools and scripts for rescoring and using LMs on it.

amitdo commented 4 years ago

BTW, there were some efforts in that direction in ocropy as well.

Are you referring to https://github.com/tmbarchive/ocropy/pull/25 or https://github.com/tmbarchive/ocropy/pull/279 ?

bertsky commented 4 years ago

@amitdo I linked to both of them above. But 25 is not as general and 279 is the undecoded/full matrix, therefore I also linked 186 which refers to OLD/ocropus-lattices (which in turn merely uses linerec.write_lattice() from OLD/linerec).

amitdo commented 4 years ago

Oops, I didn't notice the links in your first post.