Closed colibrisson closed 1 year ago
The defaults are fine now. The new default architecture, learning rate, etc. works well for decently sized datasets. In case were the network doesn't converge before the early stopping aborts training (the new architecture needs a bit more time to produce coherent output) I'd suggest setting --min-epochs
(once it's fixed in lightning) or --lag
to something like 15 or 20, although the latter will cause training for naught of a couple epochs at the end.
--augment
)1e-3 * sqrt(batch_size)
.both
mode not add
as the net will rapidly unlearn missing labels in the new dataset.ketos pretrain
warmup (--warmup
) and warmup and backbone freezing (--freeze-backbone
) for 1-2 epochs will help. I like to set the warmup end point after backbone freezing but don't really have numbers to back that up.Those are the things that come to mind right now.
Use precompiled binary datasets and put them in a place where they can be memory mapped during training (local storage, not NFS or similar). Scale batch size until GPU utilization reaches 100%.
On this subject, I was wondering why you don't use PL built-in batch-size finder?
Because on most GPUs batch size is not limited by GPU memory but either sample loading time or GPU cycles. On many you'd get batch sizes that are multiples of the training set. RNNs are relatively lightweight memory-wise but need a lot of computation.
But wouldn't it make sense as an option, especially when using binary datasets?
We could add it but to be frank I'd like to cut down on the number of switches and toggles on the training side. I played around with gdtuo to mostly eliminate learning rate/momentum/weight decay setting and potentially get rid of all the LR schedulers. All these hyperparameter switches also break quite often because testing all the combinations is a (computationally expensive) nightmare.
But as I said unless you're using a truly massive architecture you'll just end up with a GPU that is fully busy and has slightly fuller memory.
You are right.
Could you describe a "decently sized datasets" for recognition and segmentation training for historical print?
@rohanchn you can take look at the paper How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR by Strobel et al..
Thank you for sharing this @colibrisson. Indeed a very interesting paper that I read some time ago. I would still like to learn what @mittagessen thinks is decently sized in the current architecture, paticularly for lithography in RTL script (Urdu). For Bengali, I have 97% in Kraken with 35K lines. For Hindi. it's similar to Bengali but I don't have baseline annotations drawn from several texts yet. At least the Bengali model can generalize well. For Urdu, I had 95% with 34K lines. Improving CAR for Urdu looks tricky from here as I have not seen meanigful improvement even after adding 3-4K lines to the training set. This set is diverse, drawn from a variety of texts. In kraken's legacy architecture - which I guess this paper uses - I could see results in the high nineties with far fewer lines for both Bengali and Hindi.
The legacy architecture (you can just try it by changing to the old VGSL spec with the -s
switch) usually performs quite a bit worse than the new one. We haven't seen the new one to produce worse CERs than the old, no matter the size of the dataset, but for small ones you might need to adjust the early stopping threshold (it shouldn't require this but min_epochs
is broken in pytorch-lightning right now) because you need too many epochs initially to get any coherent output. There are no general rules here; the best way to determine is to just start a training run and see if the early stopping triggers too early. Everything else being equal that is.
Apart from getting the hyperparameters right, there's also the question of the amount of training data you need to get best results. This will vary widely depending on what you want to do and how much lines you can harvest. In your case 35k lines are probably more than enough. For simple lithography that is fairly uniform in style I'd say anything more than 5k lines isn't going to produce better results. In our experience it is more beneficial to assemble a wider range of training data to improve generalization which can even improve typeface/hand-specific recognition. You can do a simple dose escalation study to determine when additional data (of the same style) doesn't have anymore impact.
By legacy architecture, I meant training data in box seg format. Sorry about the confusion. I could get better results for one book with 900-1200 lines with that format. The model was not great at generalizing to new texts though. This has changed with more recent versions, but I need significantly more training data before a model can meaningfully process an unseen text. I don't think early Urdu print is straightforward, perhaps that's why I don't see a lot of improvement after what I have currently.
I am training a model right now in 4.3.3
with --min_epochs 30
and it's working fine.
After reading this thread, I experimented with default and other --batch-size
sizes and --lrate 0.001
with sqrt(batch_size)
adjustment, but couldn't get meanigful output. --no-augment
, --batch-size 1
and --lrate 0.0001
seems to work the best for urdu for me. When I scaled -B
to 16 and -r 0.0004
with --augment
, I got a drop of 180bps, but the model trained in ~half the time. Does this make sense?
I'll look into the escalation study to understand this better. Thank you.
I am training a model right now in 4.3.3 with --min_epochs 30 and it's working fine.
The switch is broken in pytorch-lightning for now. If the early stopping triggers but the minimum number of epochs isn't reached yet the trainer just runs the whole validation set after each training batch. You presumably don't want that.
Regarding augmentation, you might not get good results if your inputs are black and white. The augmentor is principally designed for grayscale inputs. Although it has been on the ToDo list to factor it out and make it a bit more flexible. We had an intern working on it a while ago....
180bps?
With dose escalation I just meant first training a model with a small dataset, then training a new one with slightly more data, and so on until you know where the limits of adding more data are. The term comes from clinical trials not machine learning so you won't find anything about it in this context. It was just a bad analogy.
You presumably don't want that.
Right, got it. Thanks!
Almost all my Urdu images are color. I also use OpenITI's data, which is also not black and white.
180bps?
Sorry, 1.8%.
When I scaled -B to 16 and -r 0.0004 with --augment, I got a drop of ~180bps~ 1.8%, but the model trained in ~half the time.
It was just a bad analogy.
Sure, I think I got what you meant. It's like what Strobel et al.. do in the paper @colibrisson mentioned.
May I recommend turning this into a page on the doc ?
@PonteIneptique @colibrisson @rohanchn If any of you feel like integrating all of this into the training notes I'd gladly merge a pull request. I probably won't find the time in the next few days.
No problem. I will do it this week.
@mittagessen where do you want to add the training best practices? At the top or at the end of docs/ketos.rst?
It's great to have a beautiful kitchen with lots of modern tools, but if you don't have the chef's recipe it's pointless. I found #445 very interesting and would really appreciate if could share your recipe regarding recognition training.