githubharald / SimpleHTR

Handwritten Text Recognition (HTR) system implemented with TensorFlow.
https://towardsdatascience.com/2326a3487cd5
MIT License
1.99k stars 894 forks source link

Wrong detection of words in model validation #169

Closed nunomrm closed 11 months ago

nunomrm commented 11 months ago

I am training the SimpleHTR model in the IAM dataset with python main.py --mode train --data_dir ../data/iam_handwriting_database/ --batch_size 250 --early_stopping 10 --decoder wordbeamsearch. I don't know why the model only identifies a single character instead of a word in the validation process, as shown below. Should I adjust the python flags of the execution above? I also runned the training without the decoder option and the model only detects a character instead of the word.

(...)

Epoch: 1 Batch: 434/438 Loss: 16.11225700378418
Epoch: 1 Batch: 435/438 Loss: 14.37238597869873
Epoch: 1 Batch: 436/438 Loss: 14.731287002563477
Epoch: 1 Batch: 437/438 Loss: 15.04417610168457
Epoch: 1 Batch: 438/438 Loss: 15.918496131896973
Validate NN
Batch: 1 / 24
Ground truth -> Recognized
[ERR:2] "bit" -> "t"
[ERR:1] "," -> "t"
[ERR:2] "Di" -> "t"
[ERR:1] "," -> "t"
[ERR:1] """ -> "t"
[ERR:2] "he" -> "t"
[ERR:3] "told" -> "t"
[ERR:3] "her" -> "t"

(...)

Epoch: 2 Batch: 434/438 Loss: 14.63033676147461
Epoch: 2 Batch: 435/438 Loss: 15.177207946777344
Epoch: 2 Batch: 436/438 Loss: 14.057470321655273
Epoch: 2 Batch: 437/438 Loss: 14.50662612915039
Epoch: 2 Batch: 438/438 Loss: 14.31212043762207
Validate NN
Batch: 1 / 24
Ground truth -> Recognized
[ERR:3] "bit" -> "a"
[ERR:1] "," -> "a"
[ERR:2] "Di" -> "a"
[ERR:1] "," -> "a"
[ERR:1] """ -> "a"
[ERR:2] "he" -> "a"
[ERR:4] "told" -> "a"
[ERR:3] "her" -> "a"
[ERR:1] "." -> "a"
[ERR:1] """ -> "a"
[ERR:3] "But" -> "a"
[ERR:1] "I" -> "a"
githubharald commented 11 months ago

don't use --decoder wordbeamsearch for training, further, don't expect any meaningful results in the first few epochs.

nunomrm commented 11 months ago

I used without that decoder option beforehand as well, and the results would be the same even in last epochs (just identifying a character only even in the final epochs). I will run again without that option, just for double checking, but I'm certain it will not train well. I'll update here.

githubharald commented 11 months ago

yes, give it a try. Just checked a training log, it should look roughly like this: "charErrorRates": [ 0.9838042269187987, 0.8809788654060067, 0.5203559510567297, 0.33205784204671857, 0.29054505005561737, 0.2439599555061179, 0.2181979977753059, 0.20262513904338153, 0.18593993325917688, 0.18740823136818688, ...

So you should get some reasonable readouts after ~10 epochs of training.

Check if the data is correct that is fed to the model, set a breakpoint here and look at the texts and the images of the first few batch elements: https://github.com/githubharald/SimpleHTR/blob/master/src/dataloader_iam.py#L134

nunomrm commented 11 months ago

After 10 epochs (not using --decoder wordbeamsearch in training) I get this:

(...)
Batch: 24 / 24
Ground truth -> Recognized
[ERR:6] "school" -> "a"
[ERR:1] "." -> "a"
[ERR:3] "Did" -> "a"
[ERR:3] "you" -> "a"
[ERR:6] "notice" -> "a"
[ERR:3] "that" -> "a"
[ERR:4] "girl" -> "a"
[ERR:3] "who" -> "a"
[ERR:3] "said" -> "a"
[ERR:5] "hullo" -> "a"
[ERR:2] "to" -> "a"
[ERR:3] "him" -> "a"
[ERR:2] "in" -> "a"
[ERR:3] "the" -> "a"
[ERR:5] "garden" -> "a"
[ERR:1] "?" -> "a"
Character error rate: 93.17018909899889%. Word accuracy: 1.9250780437044746%.
Character error rate not improved, best so far: 92.77864293659623%
No more improvement for 10 epochs. Training stopped.
nunomrm commented 11 months ago

*which is not good.

nunomrm commented 11 months ago

Moreover, when using the --fastoption to load images with LMDB, I also get this error, where images are not detected:

Train NN
Traceback (most recent call last):
  File "main.py", line 209, in <module>
    main()
  File "main.py", line 194, in main
    train(model, loader, line_mode=args.line_mode, early_stopping=args.early_stopping)
  File "main.py", line 69, in train
    batch = loader.get_next()
  File "/home/nmonteir/personal/SimpleHTR/src/dataloader_iam.py", line 130, in get_next
    imgs = [self._get_img(i) for i in batch_range]
  File "/home/nmonteir/personal/SimpleHTR/src/dataloader_iam.py", line 130, in <listcomp>
    imgs = [self._get_img(i) for i in batch_range]
  File "/home/nmonteir/personal/SimpleHTR/src/dataloader_iam.py", line 120, in _get_img
    img = pickle.loads(data)
TypeError: a bytes-like object is required, not 'NoneType'
nunomrm commented 11 months ago

yes, give it a try. Just checked a training log, it should look roughly like this: "charErrorRates": [ 0.9838042269187987, 0.8809788654060067, 0.5203559510567297, 0.33205784204671857, 0.29054505005561737, 0.2439599555061179, 0.2181979977753059, 0.20262513904338153, 0.18593993325917688, 0.18740823136818688, ...

So you should get some reasonable readouts after ~10 epochs of training.

Check if the data is correct that is fed to the model, set a breakpoint here and look at the texts and the images of the first few batch elements: https://github.com/githubharald/SimpleHTR/blob/master/src/dataloader_iam.py#L134

I created a breakpoint with breakpoint()in the get_next() object. I noticed during the first epoch that the code does not stop in that function, for debug analysis. And I tried leaving in prints of the image variable in there and the other one too, and no printing.

How could I solve this overall?

githubharald commented 11 months ago

Get a proper IDE like PyCharm (it's for free), then you can just set a breakpoint, no need to put breakpoint() functions into the code. If print statements do not work then something really weird is going on, maybe the code you changed is not executed at all? Also please make sure you work with the original code from the repo, had it couple of times that people changed the code and then reported bugs.