Closed l0rn0r closed 5 months ago
Just like with kraken, the -t
and -e
options expect a manifest file, i.e. a text file containing one path per line to whatever file type you're training on. To train with files directly just append them to the command line:
cocr train -f binary testpage_ketos.arrow
The network most likely won't converge with datasets below a couple of thousand lines in size when training from scratch. I'm working on pretrained model weights that should make fine-tuning with limited data more feasible.
I'd also suggest using kraken 5 for dataset compilation. It compiles much faster and the line extraction quality is slightly better than with version 4.x.
thanks a lot! was always working with TrOCR till now - so my kraken experience is not really existing.
I just wanted to get the handling with the right data format aso. Now I'll start a training from-scratch with the Catmus-dataset - an arrow file of 22 GB.
To get decent training speed you also need to add a couple of data loading workers (--workers
), ideally 2x the batch size, and you can get a fairly significant speedup moving the dataset file to shared memory. If possible also switch to (b)float16 precision (--precision [16|bf16]
) which reduces memory consumption quite considerably. On Catmus I got decent results with a batch size of 32 (-B 32
, requires a large GPU) and adjusting the training schedule out to 150 epochs (--epochs 150
) for eking out the last quarter percentage point in CER reduction.
On our A40s the combined speedups get me to ~3batches/s so ~100lines/s.
Hello I'm on my way to find out how to preprocess my data to train my first cocr model. Was following those steps:
pip install kraken
(for ketos) andpip install .
in the cocr-dirketos compile -f page -o testpage_ketos.arrow mypage.xml
cocr train -f binary -t testpage_ketos.arrow
I got immediately a
UnicodeDecodeError
:UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 8: invalid start byte
The output told me, it was caused by the method
_validate_manifests
in./cli/util.py
. There, on line 35, we gotfor entry in manifest.readlines():
.manifest
seems to be a<unopened file 'testpage_ketos.arrow' r>
with type<class 'click.utils.LazyFile'>
.How to fix this? Is it a problem with my *.arrow file?
My setting:
Thanks for any help!