Open bertsky opened 5 years ago
This a great idea and it would be useful in debugging and increasing the accuracy of the OCR. Also, using existing lattice formats like Kaldi, would help so much because there are a lot of existing tools and scripts for rescoring and using LMs on it.
BTW, there were some efforts in that direction in ocropy as well.
Are you referring to https://github.com/tmbarchive/ocropy/pull/25 or https://github.com/tmbarchive/ocropy/pull/279 ?
@amitdo I linked to both of them above. But 25 is not as general and 279 is the undecoded/full matrix, therefore I also linked 186 which refers to OLD/ocropus-lattices
(which in turn merely uses linerec.write_lattice()
from OLD/linerec
).
Oops, I didn't notice the links in your first post.
I would like to propose an extension to the API (and CLI) for producing a character lattice (directed acyclic character hypotheses graph) as recognition result with LSTM models.
Rationale
For some applications, such as LM rescoring, OCR post-correction, multi-OCR alignment, keyword spotting, FST decoding etc, it is not enough to produce only the best path from CTC decoding, or even the n best paths of the beam, or a sequence of alternatives (aka confusion network or sausage lattice): they all throw away information on the segmentation ambiguity, which is important to preserve for the aforementioned tasks. Representing the OCR search space by a character lattice is versatile, but relieves the user from requiring knowledge of CTC.
This is especially important in Tesseract, because its LSTM decoder contains a specially optimised form of CTC not easy to get by:
NodeContinuation
etc)UnicharCompress
,RecodedCharID
etc)Export of the full character lattice would encapsulate those intricacies and dependencies, while still allowing to integrate downstream applications tightly with OCR.
Implementation
In principle, inside src/lstm/recodebeam.cpp, it is not difficult to transform the resulting beam search trie at the last horizontal position into a lattice (with converging edges): One can traverse each node backwards along its path, generating vertices (for positions) and edges (for characters), but stopping at previously seen prefix paths (merely adding fan-out), and joining previously generated suffix paths (adding fan-in). The data structure of the beam elements,
RecodeNode
, with its linked list (prev
) and prefix hash (code_hash
), makes this straightforward. Also, one can easily re-use theTrie
class as lattice data type (perhaps as a newDawgType=DAWG_TYPE_HYPOTHESES
) by extension with public members for insertion of nodes and edges. Besides theUNICHAR_ID
characters themselves, the edges have to carry the confidence scores, and the nodes have to carry coordinates. Alternatively, one could write some simple STL-based representation from scratch. (One could probably even employ a FST representation, although it would be more difficult to preserve the coordinates there.)I already wrote a proof of concept for this, which generates a lattice based on
Trie
(though still without confidence and coords), and can also output a visualization via Graphviz. I can share it, if anyone is interested. This is what the results typically look like:Problems
Obviously, the lattices thus created are still too narrow, because the beam itself is too narrow. So perhaps lattice extraction after beam search is already too late: One could try to join hypotheses falling off the beam with neighbouring nodes (without much extra cost, but requiring to fuse APIs for recognition and extraction).
And there is yet another problem even more severe: vanilla CTC cannot reliably discern between true repetitions of a character in the input and fake duplicates which result from alignment paths in the presence of competing null output. For example, node
1
above att=311
comes from paths with two successive hyphens (with null in between) instead of one. This "diplopia" phenomenon occurs rather rarely when only the best path is to be extracted – but is not unknown, especially with suboptimal models. However, it can be expected to become overwhelming when extracting deeper alternatives and looking at a wider beam. I fail to see any simple measures to counter that (besides limited heuristics like suppressing single-pixel output hypotheses). One could adopt enhanced approaches like Equal Spacing CTC, which might be superior anyway. But this would be quite an undertaking.Conclusion
Should Tesseract provide such a feature? What is the right approach for lattice extraction? How can the diplopia problem be addressed? What API / which data structures would be preferable (without introducing too many dependencies but being maximally reusable, perhaps even in Python)?
BTW, there were some efforts in that direction in ocropy as well.
@noahmetzger @stweil @kba Correlates with this PAGE-XML extension proposal