mittagessen / conformer_ocr

text recognizer with a conformer
Apache License 2.0
6 stars 0 forks source link

UnicodeDecodeError (invalid start byte) with `cocr train -f binary` #1

Closed l0rn0r closed 3 months ago

l0rn0r commented 3 months ago

Hello I'm on my way to find out how to preprocess my data to train my first cocr model. Was following those steps:

I got immediately a UnicodeDecodeError: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 8: invalid start byte

The output told me, it was caused by the method _validate_manifests in ./cli/util.py. There, on line 35, we got for entry in manifest.readlines():. manifest seems to be a <unopened file 'testpage_ketos.arrow' r> with type <class 'click.utils.LazyFile'>.

How to fix this? Is it a problem with my *.arrow file?

My setting:

Thanks for any help!

mittagessen commented 3 months ago

Just like with kraken, the -t and -e options expect a manifest file, i.e. a text file containing one path per line to whatever file type you're training on. To train with files directly just append them to the command line:

cocr train -f binary testpage_ketos.arrow

The network most likely won't converge with datasets below a couple of thousand lines in size when training from scratch. I'm working on pretrained model weights that should make fine-tuning with limited data more feasible.

I'd also suggest using kraken 5 for dataset compilation. It compiles much faster and the line extraction quality is slightly better than with version 4.x.

l0rn0r commented 3 months ago

thanks a lot! was always working with TrOCR till now - so my kraken experience is not really existing.

I just wanted to get the handling with the right data format aso. Now I'll start a training from-scratch with the Catmus-dataset - an arrow file of 22 GB.

mittagessen commented 3 months ago

To get decent training speed you also need to add a couple of data loading workers (--workers), ideally 2x the batch size, and you can get a fairly significant speedup moving the dataset file to shared memory. If possible also switch to (b)float16 precision (--precision [16|bf16]) which reduces memory consumption quite considerably. On Catmus I got decent results with a batch size of 32 (-B 32, requires a large GPU) and adjusting the training schedule out to 150 epochs (--epochs 150) for eking out the last quarter percentage point in CER reduction.

On our A40s the combined speedups get me to ~3batches/s so ~100lines/s.