githubharald / SimpleHTR

Handwritten Text Recognition (HTR) system implemented with TensorFlow.
https://towardsdatascience.com/2326a3487cd5
MIT License
1.96k stars 885 forks source link

TypeError: a bytes-like object is required, not 'NoneType' (dataloader_iam.py line 119) #150

Closed adam-funk closed 1 year ago

adam-funk commented 1 year ago

Hi, I'm trying to train SimpleHTR on our own dataset of handwritten two-digit numbers. I think I have set the training data files up correctly:

gt/
└── words.txt
img/
└── a00
    ├── a00-000u
    │   ├── a00-000-18-95.png
    │   ├── a00-000-18-96.png
    │   ├── a00-000-18-97.png
    │   ├── a00-000-18-98.png
    │   ├── a00-000-18-99.png
    ...

and the words.txt file looks like this

a00-000-18-95.png ok 154 1544 301 49 32 CD 21
a00-000-18-96.png ok 154 1548 330 49 29 CD 16
a00-000-18-97.png ok 154 1548 367 60 24 CD 22
a00-000-18-98.png ok 154 1544 394 61 29 CD 22
a00-000-18-99.png ok 154 1543 421 63 29 CD 27
a00-000-19-00.png ok 154 1547 449 57 36 CD 21
a00-000-19-01.png ok 154 1548 478 56 34 CD 22
a00-000-19-02.png ok 154 1551 511 51 28 CD 40
a00-000-19-03.png ok 154 1546 541 59 38 CD 29
a00-000-19-04.png ok 154 1553 566 54 37 CD 18
...

For the initial test on my laptop (to move onto an HPC server later) I have 255 images. I'm using the following command:

conda run -n simplehtr python main.py --mode train --fast --data_dir /opt/data/bluejackets/training_data/22014 --batch_size 50 --early_stopping 15

The console shows output from 0 255 to 254 255, so I think it is finding the training data files, but it fails with this exception:

Traceback (most recent call last):
  File "/home/adam/sandboxes/SimpleHTR/src/main.py", line 200, in <module>
    main()
  File "/home/adam/sandboxes/SimpleHTR/src/main.py", line 185, in main
    train(model, loader, line_mode=args.line_mode, early_stopping=args.early_stopping)
  File "/home/adam/sandboxes/SimpleHTR/src/main.py", line 65, in train
    batch = loader.get_next()
  File "/home/adam/sandboxes/SimpleHTR/src/dataloader_iam.py", line 129, in get_next
    imgs = [self._get_img(i) for i in batch_range]
  File "/home/adam/sandboxes/SimpleHTR/src/dataloader_iam.py", line 129, in <listcomp>
    imgs = [self._get_img(i) for i in batch_range]
  File "/home/adam/sandboxes/SimpleHTR/src/dataloader_iam.py", line 119, in _get_img
    img = pickle.loads(data)
TypeError: a bytes-like object is required, not 'NoneType'

I admit that the environment has >= versions of the packages in requirements.txt because I can't get conda to install the exact versions.

Am I doing something wrong? Any idea how to fix this?

Thanks Adam

adam-funk commented 1 year ago

I managed to get the correct versions by creating a new conda environment and just doing pip install -r requirements.txt, but I still get exactly the same dataloader error.

githubharald commented 1 year ago

Hi, you're using the --fast option which expects a "pickled" dataset. Try without this option, then the images should be loaded separately from disk.

adam-funk commented 1 year ago

Hi I did conda run -n simplehtr1 python create_lmdb.py --data_dir "${training_directory}" first (where $training_directory is the one containing gt/words.txt and img). I thought that should allow --fast to work?

(Anyway, I'm getting another error further along without --fast but I'll see what I can do with it.)

Thanks Adam

githubharald commented 1 year ago

check if the lmdb was actually created: in the IAM dataset folder there are the subfolders gt, img and lmdb, and lmdb should contain a (quite large) data.mdb file.

Further, it might be due to broken images, as OpenCV loads broken images as None (instead of throwing an exception). In create_lmdb.py you can put an assert img is not None after the cv2.imread to check if this is the case.