mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
673 stars 125 forks source link

Print separate lines for pages in log output of extract_lines.py #591

Closed stweil closed 2 months ago

stweil commented 2 months ago

Instead of

Processing PPN1807526488/1807526488_0011.xml .....................................Processing PPN1807526488/1807526488_0005.xml ....Processing PPN1807526488/1807526488_0004.xml ..Processing PPN1807526488/1807526488_0010.xml .................................................Processing PPN1807526488/1807526488_0006.xml Processing PPN1807526488/1807526488_0012.xml ...................................Processing PPN1807526488/1807526488_0013.xml ........................Processing PPN1807526488/1807526488_0007.xml .....

it produces this log output:

Processing PPN1807526488/1807526488_0011.xml .....................................
Processing PPN1807526488/1807526488_0005.xml ....
Processing PPN1807526488/1807526488_0004.xml ..
[...]
stweil commented 2 months ago

@mittagessen, thank you for the hint to this script. It works pretty good and is really fast.

Is it also possible to extract the line images without a black background? I am not sure whether line images which only contain the original image inside of the polygon are good for Tesseract training.

I tried --legacy-polygons, but it looks like that code no longer works (it aborts with an exception).

mittagessen commented 2 months ago

Is it also possible to extract the line images without a black background? I am not sure whether line images which only contain the original image inside of the polygon are good for Tesseract training.

Hmm, not really the extract_polygons() function the script calls just masks it out and you'd need to change that one to not apply the mask. But I'm not sure of how much use the extracted lines are for Tesseract training anyway as the baseline projection the line extractor does is obviously not available in Tesseract's bbox data model.