pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.06k stars 6.93k forks source link

IndexError: Caught IndexError in DataLoader worker process 0. IndexError: list index out of range #8136

Closed Egorundel closed 10 months ago

Egorundel commented 10 months ago

🐛 Describe the bug

Hello, I have IndexError: Caught IndexError in DataLoader worker process 0. and IndexError: list index out of range. errors
when I run the code: torchrun --nproc_per_node=1 train.py --dataset coco --model retinanet_resnet50_fpn --epochs 3 --lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.01 --weights-backbone ResNet50_Weights.IMAGENET1K_V1

(RetinaNet) egorundel@egorundel:~/projects/vision/references/detection$ torchrun --nproc_per_node=1 train.py --dataset coco --model retinanet_resnet50_fpn --epochs 3 --lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.01 --weights-backbone ResNet50_Weights.IMAGENET1K_V1
| distributed init (rank 0): env://
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Namespace(amp=False, aspect_ratio_group_factor=3, backend='pil', batch_size=2, data_augmentation='hflip', data_path='/home/egorundel/data/NAMI_data_coco_without_subfolders/', dataset='coco', device='cuda', dist_backend='nccl', dist_url='env://', distributed=True, epochs=3, gpu=0, lr=0.01, lr_gamma=0.1, lr_scheduler='multisteplr', lr_step_size=8, lr_steps=[16, 22], model='retinanet_resnet50_fpn', momentum=0.9, norm_weight_decay=None, opt='sgd', output_dir='.', print_freq=20, rank=0, resume='', rpn_score_thresh=None, start_epoch=0, sync_bn=False, test_only=False, trainable_backbone_layers=None, use_copypaste=False, use_deterministic_algorithms=False, use_v2=False, weight_decay=0.0001, weights=None, weights_backbone='ResNet50_Weights.IMAGENET1K_V1', workers=4, world_size=1)
Loading data
loading annotations into memory...
Done (t=0.92s)
creating index...
index created!
loading annotations into memory...
Done (t=0.07s)
creating index...
index created!
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [   12 17403  2558  2554    87]
Creating model
[rank0]:[W Utils.hpp:106] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)
Start training
Traceback (most recent call last):
  File "train.py", line 334, in <module>
    main(args)
  File "train.py", line 309, in main
    train_one_epoch(model, optimizer, data_loader, device, epoch, args.print_freq, scaler)
  File "/home/egorundel/projects/vision/references/detection/engine.py", line 27, in train_one_epoch
    for images, targets in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/egorundel/projects/vision/references/detection/utils.py", line 171, in log_every
    for obj in iterable:
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/_utils.py", line 699, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = self.dataset.__getitems__(possibly_batched_index)
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 399, in __getitems__
    return [self.dataset[self.indices[idx]] for idx in indices]
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 399, in <listcomp>
    return [self.dataset[self.indices[idx]] for idx in indices]
  File "/home/egorundel/projects/vision/references/detection/coco_utils.py", line 196, in __getitem__
    img, target = self._transforms(img, target)
  File "/home/egorundel/projects/vision/references/detection/transforms.py", line 26, in __call__
    image, target = t(image, target)
  File "/home/egorundel/projects/vision/references/detection/coco_utils.py", line 49, in __call__
    masks = convert_coco_poly_to_mask(segmentations, h, w)
  File "/home/egorundel/projects/vision/references/detection/coco_utils.py", line 14, in convert_coco_poly_to_mask
    rles = coco_mask.frPyObjects(polygons, height, width)
  File "pycocotools/_mask.pyx", line 294, in pycocotools._mask.frPyObjects
IndexError: list index out of range

[2023-11-30 09:28:52,627] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 9114) of binary: /home/egorundel/venvs/RetinaNet/bin/python3
Traceback (most recent call last):
  File "/home/egorundel/venvs/RetinaNet/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/egorundel/venvs/RetinaNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-30_09:28:52
  host      : egorundel-B560M-H
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 9114)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

What could be the mistake? How can I fix it?

Versions

OS: Ubuntu 20.04

pip packages: torch 2.2.0.dev20231128+cu118 torchaudio 2.2.0.dev20231128+cu118 torchvision 0.17.0.dev20231128+cu118 PyYAML 6.0.1 pycocotools 2.0.7 Pillow 9.3.0 matplotlib 3.7.4

Hardware: RTX3060

NicolasHug commented 10 months ago

Likely an issue with your underlying dataset. Sorry, there's just not a lot of info for us to help here.

Egorundel commented 10 months ago

@NicolasHug And what other data is needed to reveal more information?

pmeier commented 10 months ago

Traceback points to pycocotools:

  File "pycocotools/_mask.pyx", line 294, in pycocotools._mask.frPyObjects
IndexError: list index out of range

So likely, one of the encoded masks you have in your dataset is broken, since they cannot be decoded. You need to figure out which sample is causing this and fix your data.

Egorundel commented 10 months ago

@pmeier Hello! How I can check the broken data? Maybe you have a script for check it?

pmeier commented 10 months ago

Nope, you gotta write one yourself. You need to iterate over your dataset and invoke

https://github.com/pytorch/vision/blob/30397d910519f87ff44f8afe4f68da9db54b2eb3/references/detection/coco_utils.py#L11

on all segmentations and see for which it breaks.