HTR model overtraining - Githubissues

fattynoparents commented 2 months ago

I've been trying to fine-tune a model since Loghi version 1.3.14 and I've noticed that with the last two versions - 2.0.4 and 2.0.2 I end up with overtraining even if I only run 100 epochs.

14/05/2024 06:28:32 - INFO - Epoch 65 - Average CER: 0.0640 - Validation CER: 0.0242
14/05/2024 06:28:32 - INFO - Validation CER improved from 0.0251 to 0.0242
606/606 [==============================] - 843s 1s/step - loss: 4.2215 - CER_metric: 0.0640 - WER_metric: 0.4693 - val_loss: 1.6437 - val_CER_metric: 0.0242 - val_WER_metric: 0.2291
Epoch 67/100
606/606 [==============================] - ETA: 0s - loss: 4.1452 - CER_metric: 0.0618 - WER_metric: 0.4656
14/05/2024 06:42:43 - INFO - Epoch 66 - Average CER: 0.0618 - Validation CER: 0.0228
14/05/2024 06:42:43 - INFO - Validation CER improved from 0.0242 to 0.0228

Validation CER improves with almost every epoch and is too good to be true :) I also have a small amount of data but before I had it even smaller. Is it something that can be improved by some parameters? Thanks in advance.

rvankoert commented 2 months ago

If it seems to good to be true..... It usually is. Make sure you have no overlap between your validation data and your training data.

There are many options you can set to make the most of your training data: --aug_elastic_transform applies elastic transformation (very good, very computationally expensive) --aug_random_crop makes crops in the height, should be useful when the interline distance varies within your data --aug_random_width stretches the textlines. This is a good and computationally cheap transform --aug_random_shear applies random shear. Useful when you have multiple writers in your data.

--aug_distort_jpeg applies jpeg distortion. If your data has different scan resolutions it should be useful --aug_blur applies blur. Specifically made for carbon-copied paper that we have to deal with. --aug_invert if you have white ink on black paper this is useful

In general: the first four augmentations are useful. The last three are a bit specific and I think you should only use those if you have specific data that requires this or if you are building a general basemodel using large amounts of textlines.

you can find them here: https://github.com/knaw-huc/loghi-htr/blob/master/src/setup/arg_parser.py

fattynoparents commented 2 months ago

Make sure you have no overlap between your validation data and your training data.

I think that can be the issue, thanks! I will also look at other parameters if this doesn't help the issue.

fattynoparents commented 1 month ago

Reopening this with new observations and a question.

My initial question arose because the .create-train-data.sh script only produced one text file - training_all.txt - and so I by mistake used same paths for validation and training. I then manually splitted my data in two bunches and created separate data for each of these processes.

Last time I ran the script, however, it produced three text files - training_all.txt, training_all_train.txt and training_all_val.txt - so I used these data instead of manually splitting. But I noticed that the model seems to be overtraining again, I got a result as high as about 2% CER. So the question is, can it be so that the data in the two files (training_all_train.txt and training_all_val.txt) overlap at some point (I tried to check this but didn't find overlapping at first glance)? Or could it happen by chance that the data was splitted in such a way that the training resulted in overtuning? Or what could be the reason?

Thanks in advance!

UPDATE: I have now also run the training using manually splitted data, and the model still seems to be overfitting, giving about 2-3% CER.

Simon-Dirks commented 1 month ago

I might be running into similar issues, training on +-33k words with 10% split training/validation (created using the create-training-data script).

rvankoert commented 1 month ago

The training data after splitting should not overlap. It uses the split command for splitting. One edge case where this nevertheless could happen is when you change the percentage to be used to validation to 0 (100% training) then an old file could remain and cause overlap with training and validation.

Could it be that it just reaches 2/3 percent CER? 2/3% is not uncommon if you have regular handwriting or printed text. Have you tried on some unseen materials? What's your material like? We get about 2.5% CER on multiple writers spanning a 100 years using ~500 scans of ground truth. CER on printed text typically goes below 2% after a few dozen scans and goes down to 0.1% with enough training data.

The only CER I generally look at is the validation CER. The training CER might go lower, but it's best to ignore that value. If the validation CER goes up again then you are overfitting. It's quite hard to overfit if you use enough data/and augmentations.

fattynoparents commented 1 month ago

Yes, I also only look at the validation CER. We have mixed text - printed and hand-written, various handwritings. I used about 350 pages of GT. I did try the trained model on unseen materials and it doesn't look like 2% CER to me, though I might be wrong.

rvankoert commented 1 month ago

the splitting is line-based, this mean all lines from all pages are mixed and then a percentage is drawn as validation.

You could try splitting pages into two random sets and then using the create-train-data script on the two random sets. Setting train-data to 100% in the create-train-data script for the training pages and to 0% for the validation-pages. This way you get a split based on pages.

A page-based split is more representative of unseen materials. 2% CER should be about 1 error for every 2 to three lines (depending of course on the amount of text per line).

fattynoparents commented 1 month ago

0% for the validation-pages

This gives me an error by the way: split: invalid number of lines: ‘0’

rvankoert commented 1 month ago

Of course it does. Error on my side. I always just use the training_all.txt instead of training_all_train and training_all_val. If you look in the create-train-data.sh file just comment the last three /four lines or ignore the resulting training_all_train.txt and training_all_val.txt.

When making the split page based it's a good practice to rename training_all.txt (all files) to training_all_train.txt for the training data and to training_all_val.txt for the validation. I really should write a script that does page-based splitting.

knaw-huc / loghi

HTR model overtraining #24