mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
750 stars 131 forks source link

Empty text lines exception while training with binary dataset on 4.2.0 #416

Closed rohanchn closed 1 year ago

rohanchn commented 1 year ago

In 4.2.0, ketos segtrain and ketos train with -f alto are working as expected.

However, I got an empty text lines exception when I tried to train with a binary dataset. There are indeed a few empty text lines in my dataset.

My command was ketos train --augment -d cuda:0 -f binary --base-dir R --normalization NFD --min-epochs 30 -w 0 -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' -r 0.0001 -o models/bl/21_uATR all.arrow

I set this https://github.com/mittagessen/kraken/blob/ecb47081d64eb42fdb66ce344f26576ed54ab480/kraken/lib/dataset.py#L570 to True, and now training with a binary dataset is working as expected.

Opening this issue to understand this behavior better.

mittagessen commented 1 year ago

The semantics of the line skipping flag in the dataset being reversed is a known issue in 4.2.0 and has been fixed in master for a while. I should probably tag a new release.