Training with bounding box data (png/gt.txt pair)

mittagessen / kraken

OCR engine for all the languages

http://kraken.re

Apache License 2.0

751 stars 131 forks source link

Training with bounding box data (png/gt.txt pair) #516

Closed rohanchn closed 1 year ago

rohanchn commented 1 year ago

Hi @mittagessen!

I am trying to train a recognizer with bounding box data in 4.3.12.

My ketos train command has --format-type == path.

The image/gt.txt pairs were prepared with ketos extract from the corrected output of ketos transcribe. The training works but it is terribly slow. I am not sure if I can benefit from ketos compile as I don't have the corresponding .path files for images.

Is there something I can do to improve training speed?

mittagessen commented 1 year ago

Ah this is awkward. The help message is wrong. You can just compile bbox style files like normal:

ketos compile -o bbox.arrow -f path */*.png

should work.

Apart from that training on bbox data should be quite a bit faster than binary data. Do you per chance use grayscale images or something weird like that?

rohanchn commented 1 year ago

I think I have bi-level images. I compiled a binary dataset with this command, but still no luck with the speed. The accuracy is not that bad actually, just the speed is quite slow.

colibrisson commented 1 year ago

Still no luck with the speed.

Are you referring to training speed or compilation speed? When compiling a dataset using multiple processes, most of the time in each process scikit-learn will spawn as many threads as there is core available on the system which results in very slow compilation. You can prevent this behavior by using environment variables as explained in scikit-learn documentation.

Regarding bi-level images, I have also noticed very poor performance both in training and inference.

rohanchn commented 1 year ago

@colibrisson, I am referring to training speed. ketos compile for the png/gt.txt pairs was quite faster than what I usually do, i.e. --fomat-type alto

You can prevent this behavior by using environment variables as explained in [scikit-learn documentation] (https://scikit-learn.org/stable/computing/parallelism.html#lower-level-parallelism-with-openmp).

Isn't this something I can also control by passing the number of threads I want to use to --workers?

Yes, I think inference with model trained on bi-level images somewhat deteriorates when the model is used on RGB images.

colibrisson commented 1 year ago

Isn't this something I can also control by passing the number of threads I want to use to --workers?

The --workers argument allows you to choose the number of parallel processes used during compilation. The problem is that within each of these processes, scikit-learn will spawn many threads, sometimes as many as the number of cores available.

colibrisson commented 1 year ago

In fact, I also had the issue one time during training when using augmentation. Did you check your system resources?

rohanchn commented 1 year ago

The problem is that within each of these processes, scikit-learn will spawn many threads

Right, this makes sense. I didn't notice anything weird in system resources during training though.

Will look into this. Thank you!