ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.42k stars 591 forks source link

Extended hocr #283

Open JKamlah opened 6 years ago

JKamlah commented 6 years ago

ocropy.json and extended hocr

These changes will not compromise any older functions, but giving two new features: 1) json output for each line, 2) hocr output with word boxes and probabilities. Also the added functionality could (!) replace some older stuff, it won't, and so some calculation will be done twice.

What will the addition do?

The new code produces a *.ocropy.json file for each line, which contains:

These information will be used to produce an extended-hocr file with:

How can it be started?

There are new arguments to functions:

./ocropus-gpageseg 'book/????.bin.png' -j or --json 

If gpageseg get started with -j/--json it will produce the first part of the *.ocropy.json.

The following steps (ocropus-rpred, ocropus-hocr) will recognize that a there is a *.ocropy.json file and will automatically work with it. However, it is also possible to suppress some of these steps individually with some additional argument:

./ocropus-rpred -Q 4 -m models/fraktur.pyrnn.gz 'book/????/??????.bin.png' --nojson 

Stops adding further information to the json-file. Note, that if this step will be skipped, then the extended hocr file can't be created. And

./ocropus-hocr 'book/????.bin.png' -o ersch.html -n or --normal

will anyway create the hocr file the old way (without probabilities, word boxes).

Finally, there is another parameter -c,--charconfs in ocropus-hocr to output the confidence of every char, but since this is increasing the amount of data massively, the default behaviour is not to do this. For usage of this feature:

./ocropus-hocr 'book/????.bin.png' -o ersch.html -c or --charconfs

Have fun and a Merry Christmas :christmas_tree: :)

JKamlah commented 6 years ago

Big thanks @zuphilip for all the support and @mittagessen (https://github.com/mittagessen/kraken) for the inspiring work.

mittagessen commented 6 years ago

Just a small note: You might want to split words at Unicode whitespace characters with something like regex.split('\s+') as there's more than ASCII space out there.

Otherwise it looks fine as the translate_back function has the needed adjustments to not output empty classes, so it shouldn't produce any weirdly offset bounding boxes.