Open JKamlah opened 6 years ago
Big thanks @zuphilip for all the support and @mittagessen (https://github.com/mittagessen/kraken) for the inspiring work.
Just a small note: You might want to split words at Unicode whitespace characters with something like regex.split('\s+')
as there's more than ASCII space out there.
Otherwise it looks fine as the translate_back
function has the needed adjustments to not output empty classes, so it shouldn't produce any weirdly offset bounding boxes.
ocropy.json and extended hocr
These changes will not compromise any older functions, but giving two new features: 1) json output for each line, 2) hocr output with word boxes and probabilities. Also the added functionality could (!) replace some older stuff, it won't, and so some calculation will be done twice.
What will the addition do?
The new code produces a *.ocropy.json file for each line, which contains:
These information will be used to produce an extended-hocr file with:
How can it be started?
There are new arguments to functions:
If gpageseg get started with
-j/--json
it will produce the first part of the *.ocropy.json.The following steps (ocropus-rpred, ocropus-hocr) will recognize that a there is a *.ocropy.json file and will automatically work with it. However, it is also possible to suppress some of these steps individually with some additional argument:
Stops adding further information to the json-file. Note, that if this step will be skipped, then the extended hocr file can't be created. And
will anyway create the hocr file the old way (without probabilities, word boxes).
Finally, there is another parameter
-c,--charconfs
in ocropus-hocr to output the confidence of every char, but since this is increasing the amount of data massively, the default behaviour is not to do this. For usage of this feature:Have fun and a Merry Christmas :christmas_tree: :)