Closed rohanchn closed 1 year ago
Ah this is awkward. The help message is wrong. You can just compile bbox style files like normal:
ketos compile -o bbox.arrow -f path */*.png
should work.
Apart from that training on bbox data should be quite a bit faster than binary data. Do you per chance use grayscale images or something weird like that?
I think I have bi-level images. I compiled a binary dataset with this command, but still no luck with the speed. The accuracy is not that bad actually, just the speed is quite slow.
Still no luck with the speed.
Are you referring to training speed or compilation speed? When compiling a dataset using multiple processes, most of the time in each process scikit-learn will spawn as many threads as there is core available on the system which results in very slow compilation. You can prevent this behavior by using environment variables as explained in scikit-learn documentation.
Regarding bi-level images, I have also noticed very poor performance both in training and inference.
@colibrisson, I am referring to training speed. ketos compile
for the png/gt.txt
pairs was quite faster than what I usually do, i.e. --fomat-type alto
You can prevent this behavior by using environment variables as explained in [scikit-learn documentation] (https://scikit-learn.org/stable/computing/parallelism.html#lower-level-parallelism-with-openmp).
Isn't this something I can also control by passing the number of threads I want to use to --workers
?
Yes, I think inference with model trained on bi-level images somewhat deteriorates when the model is used on RGB images.
Isn't this something I can also control by passing the number of threads I want to use to --workers?
The --workers
argument allows you to choose the number of parallel processes used during compilation. The problem is that within each of these processes, scikit-learn will spawn many threads, sometimes as many as the number of cores available.
In fact, I also had the issue one time during training when using augmentation. Did you check your system resources?
The problem is that within each of these processes, scikit-learn will spawn many threads
Right, this makes sense. I didn't notice anything weird in system resources during training though.
Will look into this. Thank you!
Hi @mittagessen!
I am trying to train a recognizer with bounding box data in
4.3.12
.My
ketos train
command has--format-type == path
.The image/gt.txt pairs were prepared with
ketos extract
from the corrected output ofketos transcribe
. The training works but it is terribly slow. I am not sure if I can benefit fromketos compile
as I don't have the corresponding.path
files for images.Is there something I can do to improve training speed?