Extended hocr - Githubissues

JKamlah commented 6 years ago

ocropy.json and extended hocr

These changes will not compromise any older functions, but giving two new features: 1) json output for each line, 2) hocr output with word boxes and probabilities. Also the added functionality could (!) replace some older stuff, it won't, and so some calculation will be done twice.

What will the addition do?

The new code produces a *.ocropy.json file for each line, which contains:

fpath
id
scale
padding
bboxes (line, word, char)
prob (word, char)

These information will be used to produce an extended-hocr file with:

word/char probabilities
word bboxes, e.g. (new hocr file of testpage visualized with hocrjs)

How can it be started?

There are new arguments to functions:

./ocropus-gpageseg 'book/????.bin.png' -j or --json

If gpageseg get started with -j/--json it will produce the first part of the *.ocropy.json.

The following steps (ocropus-rpred, ocropus-hocr) will recognize that a there is a *.ocropy.json file and will automatically work with it. However, it is also possible to suppress some of these steps individually with some additional argument:

./ocropus-rpred -Q 4 -m models/fraktur.pyrnn.gz 'book/????/??????.bin.png' --nojson

Stops adding further information to the json-file. Note, that if this step will be skipped, then the extended hocr file can't be created. And

./ocropus-hocr 'book/????.bin.png' -o ersch.html -n or --normal

will anyway create the hocr file the old way (without probabilities, word boxes).

Finally, there is another parameter -c,--charconfs in ocropus-hocr to output the confidence of every char, but since this is increasing the amount of data massively, the default behaviour is not to do this. For usage of this feature:

./ocropus-hocr 'book/????.bin.png' -o ersch.html -c or --charconfs

Have fun and a Merry Christmas :christmas_tree: :)

JKamlah commented 6 years ago

Big thanks @zuphilip for all the support and @mittagessen (https://github.com/mittagessen/kraken) for the inspiring work.

mittagessen commented 6 years ago

Just a small note: You might want to split words at Unicode whitespace characters with something like regex.split('\s+') as there's more than ASCII space out there.

Otherwise it looks fine as the translate_back function has the needed adjustments to not output empty classes, so it shouldn't produce any weirdly offset bounding boxes.

ocropus-archive / DUP-ocropy

Extended hocr #283

ocropy.json and extended hocr

What will the addition do?

How can it be started?