knaw-huc / loghi

MIT License
97 stars 13 forks source link

Can't set early_stopping_patience, random_width and elastic_transform when finetuning a model #21

Closed fattynoparents closed 4 months ago

fattynoparents commented 4 months ago

I'm trying to fine-tune an existing model, but the cer doesn't seem to be much improving. I don't have much GPU power or data, so I checked this tutorial with some tips https://github.com/knaw-huc/loghi/blob/main/tips_and_tricks.md where it says:

When you have little data or don't care about training time and want the best results use:

--random_width: augments the data by stretching and squeezing the input textline horizontally

--elastic_transform: augments the data by a random elastic transform

I added these parameters to the na-pipeline-train.sh file and echo the output when running the script, here it is:

Starting Loghi HTR
docker run --gpus all --rm -u 0:0 -m 32000m --shm-size 10240m -ti 
-v /home/user/training/results/model-from-2024-02-19/output:/home/user/training/results/model-from-2024-02-19/output 
-v /tmp/tmp.ZyzcT5WKqF:/tmp/tmp.ZyzcT5WKqF -v /home/user/training/2024.03.06/output:/home/user/training/2024.03.06/output 
-v /scratch/republicprint:/scratch/republicprint loghi/docker.htr:1.3.10 python3 /src/loghi-htr/src/main.py 
--do_train --train_list /home/user/training/2024.03.06/output/training_all_train.txt 
--do_validate --validation_list /home/user/training/2024.03.06/output/training_all_val.txt 
--learning_rate 0.0002 
--channels 4 
--batch_size 2 
--epochs 100 
--gpu 1 
--height 64 
--use_mask 
--seed 1 
--beam_width 10 
--model None,64,None,3 Cr3,3,24 Bn Mp2,2,2,2 Cr3,3,48 Bn Mp2,2,2,2 Cr3,3,96 Bn Cr3,3,96 Bn Mp2,2,2,2 Rc Bl256 Bl256 Bl256 Bl256 Bl256 O1s92 
--multiply 1 
--output /home/user/training/2024.03.06/output 
--model_name thirdmodel 
--output_charlist /tmp/tmp.ZyzcT5WKqF/output_charlist.charlist 
--output /tmp/tmp.ZyzcT5WKqF/output 
--existing_model /home/user/training/results/model-from-2024-02-19/output/best_val 
--early_stopping_patience 50 
--random_width 
--elastic_transform
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

However, in further output I see that these parameters are not set:

{'gpu': '1', 'output': '/tmp/tmp.ZyzcT5WKqF/output', 'batch_size': 2, 'results_file': 'output/results.txt', 
'config_file_output': None, 'optimizer': 'adam', 'seed': 1, 'channels': 4, 'max_queue_size': 256, 'use_mask': True, 
'charlist': None, 'output_charlist': '/tmp/tmp.ZyzcT5WKqF/output_charlist.charlist', 'do_train': True, 'learning_rate': 0.0002, 
'epochs': 100, 'height': 64, 'width': 65536, 'train_list': '/home/user/training/2024.03.06/output/training_all_train.txt', 
'decay_steps': -1, 'decay_rate': 0.99, 'steps_per_epoch': None, 'output_checkpoints': False, 'use_float32': False, 
'early_stopping_patience': 20, 'multiply': 1, 'do_validate': True, 'validation_list': '/home/user/training/2024.03.06/output/training_all_val.txt', 
'test_list': None, 'training_verbosity_mode': 'auto', 
'do_inference': False, 'inference_list': None, 'model': 'None,64,None,3 Cr3,3,24 Bn Mp2,2,2,2 Cr3,3,48 Bn Mp2,2,2,2 Cr3,3,96 Bn Cr3,3,96 Bn Mp2,2,2,2 Rc Bl256 Bl256 Bl256 Bl256 Bl256 O1s92', 
'existing_model': '/home/user/training/results/model-from-2024-02-19/output/best_val', 
'model_name': 'thirdmodel', 'replace_final_layer': False, 'replace_recurrent_layer': None, 'thaw': False, 'freeze_conv_layers': False, 
'freeze_recurrent_layers': False, 'freeze_dense_layers': False, 'augment': False, 'elastic_transform': False, 
'random_crop': False, 'random_width': False, 'distort_jpeg': False, 'do_random_shear': False, 'greedy': False, 
'beam_width': 10, 'corpus_file': None, 'wbs_smoothing': 0.1, 'do_binarize_otsu': False, 
'do_binarize_sauvola': False, 'ignore_lines_unknown_character': False, 'check_missing_files': False, 'normalization_file': None, 'deterministic': False, 
'no_auto': False, 'do_blur': False, 'do_invert': False}

I have also set batch_size to 2 and learning_rate to 0.0002 and these parameters were set fine, but not random_width, elastic_transform and early_stopping_patience.

Could you please help me with this? Also, what can be possible reasons for the model training not improving apart from small amount of data? Maybe there are some other tips that one can follow? Thanks a lot in advance.

rvankoert commented 4 months ago

Just a guess: did you add the parameters to the actual command as well? it is mentioned once as an echo only and then again as the actual command.

fattynoparents commented 4 months ago

Thanks for the quick reply. Yes, I had made this error before, so I was careful to add the parameters to the actual command this time.

rvankoert commented 4 months ago

The command looks good. I'll try to debug later today to see what happens. Meanwhile I noticed a GPU=1 Unless you have more than one GPU it should be GPU=0 I'll get back to you later today

fattynoparents commented 4 months ago

Thank you! Yes, I was experimenting a bit with various parameters, I have now changed it back to 0. Strangely enough, it looks as if it doesn't matter whether to set 0 or 1, loghi detects the correct GPU automatically.

fattynoparents commented 4 months ago

Btw, there's one more thing I wondered about (don't know if a new issue should be opened or I can ask it here) - when there are characters in the training data that are not present in the charlist.txt, they are ignored in the training process.

So I get f.ex. such a message before the training start: a ignoring line: Fvn. Gaglviðr och några därmed besläktade ord Av because the ð character is not present in the charlist file.

If I try to modify the charlist file to add the missing character, I expectedly get this error: _number of characters in model is different from charlist provided. please find correct charlist and use --charlist CORRECTCHARLIST Is there a workaround on how to correctly modify the charlist?

rvankoert commented 4 months ago

another guess: you added the new parameters last? If so: add them before --output . Like this: --output_charlist $tmpdir/output_charlist.charlist \ --random_width \ --elastic_transform \ --early_stopping_patience 50 \ --output $tmpdir/output $BASEMODEL

look careful for the "\" on each line.

rvankoert commented 4 months ago

Btw, there's one more thing I wondered about (don't know if a new issue should be opened or I can ask it here) - when there are characters in the training data that are not present in the charlist.txt, they are ignored in the training process.

So I get f.ex. such a message before the training start: a ignoring line: Fvn. Gaglviðr och några därmed besläktade ord Av because the ð character is not present in the charlist file.

If I try to modify the charlist file to add the missing character, I expectedly get this error: _number of characters in model is different from charlist provided. please find correct charlist and use --charlist CORRECTCHARLIST Is there a workaround on how to correctly modify the charlist?

the suggested way is to use --replace_final and a low learning rate. For more details look into the tips & tricks as it also explains how to freeze and thaw all the other layers. I'll try to update that file somewhere in the next few weeks.

We might add the ability to include new characters without replacing the complete final layer.

You could try as a workaround to replace one of the characters in the charlist. Make sure to use an editor that can handle all the weird uni-code characters that are in that file and that it afterwards contains the same number of characters. Make sure to insert the new character in the same place as the removed character had in the file. This is totally untested and might not work at all.

fattynoparents commented 4 months ago

another guess: you added the new parameters last? If so: add them before --output . Like this: --output_charlist $tmpdir/output_charlist.charlist \ --random_width \ --elastic_transform \ --early_stopping_patience 50 \ --output $tmpdir/output $BASEMODEL

look careful for the "\" on each line.

Thanks for the suggestion, it really seemed to help, I now get the following output:

{'gpu': '0', 'output': '/tmp/tmp.ui0yfVv7QB/output', 'batch_size': 2, 'results_file': 'output/results.txt', 
'config_file_output': None, 'optimizer': 'adam', 'seed': 1, 'channels': 4, 'max_queue_size': 256, 'use_mask': True, 
'charlist': None, 'output_charlist': '/tmp/tmp.ui0yfVv7QB/output_charlist.charlist', 'do_train': True, 
'learning_rate': 0.0003, 'epochs': 100, 'height': 64, 'width': 65536, 'train_list': '/home/user/training/2024.03.06/output/training_all_train.txt', 'decay_steps': -1, 'decay_rate': 0.99,
 'steps_per_epoch': None, 'output_checkpoints': False, 'use_float32': False, 'early_stopping_patience': 20, 
'multiply': 1, 'do_validate': True, 'validation_list': '/home/user/training/2024.03.06/output/training_all_val.txt', 
'test_list': None, 'training_verbosity_mode': 'auto', 'do_inference': False, 'inference_list': None, 
'model': 'None,64,None,3 Cr3,3,24 Bn Mp2,2,2,2 Cr3,3,48 Bn Mp2,2,2,2 Cr3,3,96 Bn Cr3,3,96 Bn Mp2,2,2,2 Rc Bl256 Bl256 Bl256 Bl256 Bl256 O1s92', 
'existing_model': '/home/user/training/results/2024-02-19-base-model/output/best_val', 
'model_name': 'model_trained_two_rounds', 'replace_final_layer': False, 
'replace_recurrent_layer': None, 'thaw': False, 'freeze_conv_layers': False, 'freeze_recurrent_layers': False, 
'freeze_dense_layers': False, 'augment': False, 'elastic_transform': True, 'random_crop': False, 
'random_width': True, 'distort_jpeg': False, 'do_random_shear': False, 'greedy': False, 'beam_width': 1, 
'corpus_file': None, 'wbs_smoothing': 0.1, 'do_binarize_otsu': False, 
'do_binarize_sauvola': False, 'ignore_lines_unknown_character': False, 
'check_missing_files': False, 'normalization_file': None, 'deterministic': False, 
'no_auto': False, 'do_blur': False, 'do_invert': False}

random_width and elastic_transform are set to True. However, I now get the following error when running the script:

ValueError: in user code:

    File "/src/loghi-htr/src/data_generator.py", line 133, in load_images  *
        image = tf.image.resize(image, [image_height, image_width])

    ValueError: 'images' contains no shape.
rvankoert commented 4 months ago

I think there is a bug there. Could you try as a workaround to add --random_crop

fattynoparents commented 4 months ago

I think there is a bug there. Could you try as a workaround to add --random_crop

Ty, I added --random_crop and it helped, now the training started running. I will compare the results with what I got when having random_width and elastic_transform set to false.

fattynoparents commented 4 months ago

Btw, there's one more thing I wondered about (don't know if a new issue should be opened or I can ask it here) - when there are characters in the training data that are not present in the charlist.txt, they are ignored in the training process. So I get f.ex. such a message before the training start: a ignoring line: Fvn. Gaglviðr och några därmed besläktade ord Av because the ð character is not present in the charlist file. If I try to modify the charlist file to add the missing character, I expectedly get this error: _number of characters in model is different from charlist provided. please find correct charlist and use --charlist CORRECTCHARLIST Is there a workaround on how to correctly modify the charlist?

the suggested way is to use --replace_final and a low learning rate. For more details look into the tips & tricks as it also explains how to freeze and thaw all the other layers. I'll try to update that file somewhere in the next few weeks.

We might add the ability to include new characters without replacing the complete final layer.

You could try as a workaround to replace one of the characters in the charlist. Make sure to use an editor that can handle all the weird uni-code characters that are in that file and that it afterwards contains the same number of characters. Make sure to insert the new character in the same place as the removed character had in the file. This is totally untested and might not work at all.

I will try to follow your advice, thanks. Meanwhile, I wanted to bring your attention to another detail with the charlist.txt file. After one has fine-tuned a model one gets a folder with three files - model.keras, config.json and charlist.txt. However, when I then try to use this model in inference, I can't use this newly created charlist.txt file, I get this error:

model_outputs: 457
charlist: 456
number of characters in model is different from charlist provided.
please find correct charlist and use --charlist CORRECT_CHARLIST
if the charlist is just 1 lower: did you forget --use_mask

I do use --use_mask (it's set to true by default and I didn't change it).

fattynoparents commented 4 months ago

You could try as a workaround to replace one of the characters in the charlist. Make sure to use an editor that can handle all the weird uni-code characters that are in that file and that it afterwards contains the same number of characters. Make sure to insert the new character in the same place as the removed character had in the file. This is totally untested and might not work at all.

I have now tested it and it goes quite well. I have replaced some cyrillic letters with the ones from the greek alphabet and the resulting charlist.txt is running fine when transcribing.

fattynoparents commented 4 months ago

the suggested way is to use --replace_final

So I have used the --replace_final when training a model with data having characters that didn't exist in the charlist.txt, and it seems to work fine. The resulting charlist.txt suits for transcribing a new bunch of data.