Can't run train_tensorflow.py on Google Colab GPU - only works on CPU

fosple commented 6 months ago

Bug description

Using "doctr/references/recognition/train_tensorflow.py" on Google Colab creates an error when I use GPU acceleration. If I only use the CPU everything works just fine.

Code snippet to reproduce the bug

Open Google Colab: https://colab.research.google.com

Add the code to the colab document

!git clone https://github.com/mindee/doctr.git
!pip install -e doctr/.
!pip install tf2onnx

# Contains data/train and data/val folders, each with a file "labels.json" and folder "images"
!curl -LO https://www.myserver.com/100k_files.zip
!unzip -qq 100k_files.zip

!python /content/doctr/references/recognition/train_tensorflow.py crnn_vgg16_bn --min-chars 5 --max-chars 5 --train_path data/train --val_path data/val --epochs 100

Change settings (menu bar):

Runtime -> Change runtime type:

Python 3
Hardware accelerator: CPU

--> Code runs without an issue

Change settings (menu bar):

"Runtime" -> "Change runtime type":

Python 3
Hardware accelerator: T4 GPU

--> Creates the error below (see traceback)

Error traceback

Traceback (most recent call last):
  File "/content/doctr/references/recognition/train_tensorflow.py", line 448, in <module>
    main(args)
  File "/content/doctr/references/recognition/train_tensorflow.py", line 346, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, args.amp)
  File "/content/doctr/references/recognition/train_tensorflow.py", line 91, in fit_one_epoch
    for images, targets in pbar:
  File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/content/doctr/doctr/datasets/loader.py", line 95, in __next__
    samples = list(multithread_exec(self.dataset.__getitem__, indices, threads=self.num_workers))
  File "/content/doctr/doctr/utils/multithreading.py", line 49, in multithread_exec
    results = map(lambda x: x, tp.map(func, seq))  # noqa: C417
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/content/doctr/doctr/datasets/datasets/base.py", line 56, in __getitem__
    img = self.img_transforms(img)
  File "/content/doctr/doctr/transforms/modules/tensorflow.py", line 57, in __call__
    x = t(x)
  File "/content/doctr/doctr/transforms/modules/tensorflow.py", line 111, in __call__
    img = tf.image.resize(img, self.wanted_size, self.method, self.preserve_aspect_ratio, self.antialias)
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py", line 5883, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__StridedSlice_device_/job:localhost/replica:0/task:0/device:GPU:0}} Index out of range using input dim 1; input has only 1 dims [Op:StridedSlice] name: strided_slice/

Sometimes I also get:

Traceback (most recent call last):
  File "/content/doctr/references/recognition/train_tensorflow.py", line 448, in <module>
    main(args)
  File "/content/doctr/references/recognition/train_tensorflow.py", line 346, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, args.amp)
  File "/content/doctr/references/recognition/train_tensorflow.py", line 91, in fit_one_epoch
    for images, targets in pbar:
  File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/content/doctr/doctr/datasets/loader.py", line 95, in __next__
    samples = list(multithread_exec(self.dataset.__getitem__, indices, threads=self.num_workers))
  File "/content/doctr/doctr/utils/multithreading.py", line 49, in multithread_exec
    results = map(lambda x: x, tp.map(func, seq))  # noqa: C417
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/content/doctr/doctr/datasets/datasets/base.py", line 56, in __getitem__
    img = self.img_transforms(img)
  File "/content/doctr/doctr/transforms/modules/tensorflow.py", line 57, in __call__
    x = t(x)
  File "/content/doctr/doctr/transforms/modules/base.py", line 216, in __call__
    return self.transform(img) if target is None else self.transform(img, target)  # type: ignore[call-arg]
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/content/doctr/doctr/transforms/modules/tensorflow.py", line 401, in __call__
    _gaussian_filter(
  File "/content/doctr/doctr/transforms/functional/tensorflow.py", line 225, in _gaussian_filter
    [(width - 1) // 2, width - 1 - (width - 1) // 2],
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__FloorDiv_device_/job:localhost/replica:0/task:0/device:GPU:0}} Integer division by zero [Op:FloorDiv] name:

Environment

DocTR version: 0.9.0a0 TensorFlow version: 2.15.0 PyTorch version: 2.2.1+cu121 (torchvision 0.17.1+cu121) OpenCV version: 4.8.0 OS: Ubuntu 22.04.3 LTS Python version: 3.10.12 Is CUDA available (TensorFlow): Yes Is CUDA available (PyTorch): Yes CUDA runtime version: 12.2.140 GPU models and configuration: GPU 0: Tesla T4 Nvidia driver version: 535.104.05 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6

felixdittrich92 commented 6 months ago

Hi @fosple 👋 That's already an known issue we are on it :) CC @odulcy-mindee

As a workaround you can disable multiprocessing --> https://mindee.github.io/doctr/using_doctr/running_on_aws.html

This should fix the issue

felixdittrich92 commented 6 months ago

Hi @fosple :wave: has it solved your problem ? :)

fosple commented 6 months ago

@felixdittrich92 Thanks for the super fast reply :) In the end I used the PyTorch version, as this one worked out of the box for me. But I can try the next days if your solution would solve this specific problem.

felixdittrich92 commented 6 months ago

@fosple great so i think we can close this :)

Feel free to reopen if anything doesn't works :+1:

mindee / doctr