mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
724 stars 130 forks source link

Transcription model producing empty output #523

Closed mabarber92 closed 5 months ago

mabarber92 commented 1 year ago

As noted in today's meeting the model trained from scratch on Arabic manuscripts fails to recognise any text. I have tried running it on binarized and un-binarized. The output in eScriptorium is like this:

image

The model being used is this: ms_scratch.zip

I believe the model may have an error, as it doesn't have an accuracy rate. However, eScriptorium does not error (and I'm assuming that Kraken isn't erroring in the backend either, as it produces output).

mabarber92 commented 1 year ago

When running the command for a set of manuscripts

ketos train --device cuda:0 --output p11-scratch --normalization NFD --normalize-whitespace --format-type alto sbzb_glaser_33.pdf_page_11.xml

The log is as follows:

p11-scratch.log

David pulled the latest version of Kraken from the main branch but does not get the training loss. Can you suggest the correct branch to use?

When running equivalent training for print

ketos train --device cuda:0 --load ~j.murel/ArabicTestOutput/print_transcription_NEW.mlmodel --output p11 --normalization NFD --normalize-whitespace --resize add --format-type alto sbzb_glaser_33.pdf_page_11.xml

This is the log:

p11.log

mabarber92 commented 1 year ago

@mittagessen @dasmiq Would it make sense to figure this out within this issue?

dasmiq commented 1 year ago

This was the page I was trying to train on: https://github.com/OpenITI/arabic_ms_data/blob/main/firuzabadi_al_qamus_al_muhit/sbzb_glaser_33/sbzb_glaser_33.pdf_page_11.xml (I've seen the same issue of zero accuracy with larger training runs, but hopefully this is sufficient to test.)

dasmiq commented 1 year ago

@mabarber92 To correct what you said, my second training run was on the same manuscript page, but I initialized the model from a print model instead of starting from scratch.

colibrisson commented 1 year ago

David pulled the latest version of Kraken from the main branch but does not get the training loss. Can you suggest the correct branch to use?

You can monitor the training using Tensorboard. You just need to add the --logger tensorboard and --log-dir ./ arguments to your command.

colibrisson commented 1 year ago

@mabarber92 To correct what you said, my second training run was on the same manuscript page, but I initialized the model from a print model instead of starting from scratch.

Have you tried to use a smaller LR, i.e. 0.0001?

dasmiq commented 1 year ago

Yes, I've used smaller learning rates in many experiments. But I sent the results using the default parameters for simplicity.

dstoekl commented 1 year ago

I usually set -B 1 -r 0.0001 -w 0 for greater batchsize increase -r by squareroot of -B

a single page is not enough GT to converge however.

dasmiq commented 1 year ago

Here is the log training at --lrate 0.0001 p11-scratch-le4.log

dasmiq commented 1 year ago

Thank you @dstoekl I know a single page is small, but this issue with training from scratch persists at larger amounts of training data. Could you suggest an amount of data and training parameters you would like us to run?

dstoekl commented 1 year ago

50 pp? increase -lag to 20?

colibrisson commented 1 year ago

You should first use Tensorboard to check if your training loss is decreasing.

mittagessen commented 1 year ago

@dasmiq The training loss probably gets overwritten by the pytorch-lightning progress bar when the validation loss is available. Tensorboard logging is probably the quickest way to check it though.