Training advice 2 : recognition

colibrisson commented 1 year ago

It's great to have a beautiful kitchen with lots of modern tools, but if you don't have the chef's recipe it's pointless. I found #445 very interesting and would really appreciate if could share your recipe regarding recognition training.

mittagessen commented 1 year ago

The defaults are fine now. The new default architecture, learning rate, etc. works well for decently sized datasets. In case were the network doesn't converge before the early stopping aborts training (the new architecture needs a bit more time to produce coherent output) I'd suggest setting --min-epochs (once it's fixed in lightning) or --lag to something like 15 or 20, although the latter will cause training for naught of a couple epochs at the end.

Augmentation helps (set --augment)
Scaling the learning rate by the square root of the batch size is reasonable (1e-3 * sqrt(batch_size).
Fine-tuning is immensely powerful. Use both mode not add as the net will rapidly unlearn missing labels in the new dataset.
If the new dataset is fairly dissimilar or your base model is one that has been pretrained with ketos pretrain warmup (--warmup) and warmup and backbone freezing (--freeze-backbone) for 1-2 epochs will help. I like to set the warmup end point after backbone freezing but don't really have numbers to back that up.
LR schedules don't seem to do much. I know a lot of people use them but as optimal number of epochs to train highly depends on the dataset size and complexity dialing that in correctly can be difficult. LR schedules + early stopping is not the best way to do things. It's an either/or proposition.
Use precompiled binary datasets and put them in a place where they can be memory mapped during training (local storage, not NFS or similar). Scale batch size until GPU utilization reaches 100%.
Upload your models to the model repository ;).

Those are the things that come to mind right now.

colibrisson commented 1 year ago

Use precompiled binary datasets and put them in a place where they can be memory mapped during training (local storage, not NFS or similar). Scale batch size until GPU utilization reaches 100%.

On this subject, I was wondering why you don't use PL built-in batch-size finder?

mittagessen commented 1 year ago

Because on most GPUs batch size is not limited by GPU memory but either sample loading time or GPU cycles. On many you'd get batch sizes that are multiples of the training set. RNNs are relatively lightweight memory-wise but need a lot of computation.

colibrisson commented 1 year ago

But wouldn't it make sense as an option, especially when using binary datasets?

mittagessen commented 1 year ago

We could add it but to be frank I'd like to cut down on the number of switches and toggles on the training side. I played around with gdtuo to mostly eliminate learning rate/momentum/weight decay setting and potentially get rid of all the LR schedulers. All these hyperparameter switches also break quite often because testing all the combinations is a (computationally expensive) nightmare.

But as I said unless you're using a truly massive architecture you'll just end up with a GPU that is fully busy and has slightly fuller memory.

colibrisson commented 1 year ago

You are right.

rohanchn commented 1 year ago

Could you describe a "decently sized datasets" for recognition and segmentation training for historical print?

colibrisson commented 1 year ago

@rohanchn you can take look at the paper How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR by Strobel et al..

rohanchn commented 1 year ago

Thank you for sharing this @colibrisson. Indeed a very interesting paper that I read some time ago. I would still like to learn what @mittagessen thinks is decently sized in the current architecture, paticularly for lithography in RTL script (Urdu). For Bengali, I have 97% in Kraken with 35K lines. For Hindi. it's similar to Bengali but I don't have baseline annotations drawn from several texts yet. At least the Bengali model can generalize well. For Urdu, I had 95% with 34K lines. Improving CAR for Urdu looks tricky from here as I have not seen meanigful improvement even after adding 3-4K lines to the training set. This set is diverse, drawn from a variety of texts. In kraken's legacy architecture - which I guess this paper uses - I could see results in the high nineties with far fewer lines for both Bengali and Hindi.

mittagessen commented 1 year ago

The legacy architecture (you can just try it by changing to the old VGSL spec with the -s switch) usually performs quite a bit worse than the new one. We haven't seen the new one to produce worse CERs than the old, no matter the size of the dataset, but for small ones you might need to adjust the early stopping threshold (it shouldn't require this but min_epochs is broken in pytorch-lightning right now) because you need too many epochs initially to get any coherent output. There are no general rules here; the best way to determine is to just start a training run and see if the early stopping triggers too early. Everything else being equal that is.

Apart from getting the hyperparameters right, there's also the question of the amount of training data you need to get best results. This will vary widely depending on what you want to do and how much lines you can harvest. In your case 35k lines are probably more than enough. For simple lithography that is fairly uniform in style I'd say anything more than 5k lines isn't going to produce better results. In our experience it is more beneficial to assemble a wider range of training data to improve generalization which can even improve typeface/hand-specific recognition. You can do a simple dose escalation study to determine when additional data (of the same style) doesn't have anymore impact.

rohanchn commented 1 year ago

By legacy architecture, I meant training data in box seg format. Sorry about the confusion. I could get better results for one book with 900-1200 lines with that format. The model was not great at generalizing to new texts though. This has changed with more recent versions, but I need significantly more training data before a model can meaningfully process an unseen text. I don't think early Urdu print is straightforward, perhaps that's why I don't see a lot of improvement after what I have currently.

I am training a model right now in 4.3.3 with --min_epochs 30 and it's working fine. After reading this thread, I experimented with default and other --batch-size sizes and --lrate 0.001 with sqrt(batch_size) adjustment, but couldn't get meanigful output. --no-augment, --batch-size 1 and --lrate 0.0001 seems to work the best for urdu for me. When I scaled -B to 16 and -r 0.0004 with --augment, I got a drop of 180bps, but the model trained in ~half the time. Does this make sense?

I'll look into the escalation study to understand this better. Thank you.

mittagessen commented 1 year ago

I am training a model right now in 4.3.3 with --min_epochs 30 and it's working fine.

The switch is broken in pytorch-lightning for now. If the early stopping triggers but the minimum number of epochs isn't reached yet the trainer just runs the whole validation set after each training batch. You presumably don't want that.

Regarding augmentation, you might not get good results if your inputs are black and white. The augmentor is principally designed for grayscale inputs. Although it has been on the ToDo list to factor it out and make it a bit more flexible. We had an intern working on it a while ago....

180bps?

With dose escalation I just meant first training a model with a small dataset, then training a new one with slightly more data, and so on until you know where the limits of adding more data are. The term comes from clinical trials not machine learning so you won't find anything about it in this context. It was just a bad analogy.

rohanchn commented 1 year ago

You presumably don't want that.

Right, got it. Thanks!

Almost all my Urdu images are color. I also use OpenITI's data, which is also not black and white.

180bps?

Sorry, 1.8%.

When I scaled -B to 16 and -r 0.0004 with --augment, I got a drop of ~180bps~ 1.8%, but the model trained in ~half the time.

It was just a bad analogy.

Sure, I think I got what you meant. It's like what Strobel et al.. do in the paper @colibrisson mentioned.

PonteIneptique commented 1 year ago

May I recommend turning this into a page on the doc ?

mittagessen commented 1 year ago

@PonteIneptique @colibrisson @rohanchn If any of you feel like integrating all of this into the training notes I'd gladly merge a pull request. I probably won't find the time in the next few days.

colibrisson commented 1 year ago

No problem. I will do it this week.

colibrisson commented 1 year ago

@mittagessen where do you want to add the training best practices? At the top or at the end of docs/ketos.rst?

mittagessen / kraken

Training advice 2 : recognition #446