mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
751 stars 131 forks source link

Is it possible to see what file is being processed in segment command? #636

Closed mirkh closed 2 months ago

mirkh commented 2 months ago

Hello,

I'm segmenting almost a million images (in batches) with a model we trained. It takes a long time, and then it is important to find out if and where it fails or warns.

At the moment I'm running kraken version 4.2.0.

I use this command to run segmentation of png-files in a folder, creating alto xml files:

kraken -d cuda:0 -I '*.png' -o .xml --alto segment -bl -i model.mlmodel

Is there any option to add to make the output print not just

Segmenting ✓

but also what file it is currently segmenting?

Thanks! / Maria

mittagessen commented 2 months ago

I've pushed a small change to print it per default now. In general I'd suggest updating to the latest version. It should be quite a bit faster for segmentation.

In addition, batch processing like this is much faster when run with a tool like parallel. Segmenting a million pages serially is going to take ages while with parallelization you can just throw whatever resources you got at it (and failures on single pages don't take down everything else). You incur the additional overhead of having to load the model for each process but this is usually acceptable.

mirkh commented 2 months ago

Thank you very much! Both for working on kraken, for the tips, and for the change you made!