voldemortX / pytorch-auto-drive

PytorchAutoDrive: Segmentation models (ERFNet, ENet, DeepLab, FCN...) and Lane detection models (SCNN, RESA, LSTR, LaneATT, BézierLaneNet...) based on PyTorch with fast training, visualization, benchmarking & deployment help
BSD 3-Clause "New" or "Revised" License
840 stars 138 forks source link

RuntimeError: received 0 items of ancdata #138

Closed GCQi closed 1 year ago

GCQi commented 1 year ago

When I train the lstr based tusimple, as the command is python main_landet.py --train --config ./configs/lane_detection/lstr/resnet18s_tusimple.py --mixed-precision, it run sevel epochs and randomly export the error RuntimeError: received 0 items of ancdata

The error message is:

Traceback (most recent call last):
  File "main_landet.py", line 80, in <module>
    runner.run()
  File "/data/123/gcq/LaneDetection/pytorch-auto-drive/utils/runners/lane_det_trainer.py", line 35, in run
    for i, data in enumerate(self.dataloader, 0):
  File "/home/123/anaconda3/envs/pad/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/123/anaconda3/envs/pad/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/123/anaconda3/envs/pad/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/home/123/anaconda3/envs/pad/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/123/anaconda3/envs/pad/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "/home/123/anaconda3/envs/pad/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 305, in rebuild_storage_fd
    fd = df.detach()
  File "/home/123/anaconda3/envs/pad/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/123/anaconda3/envs/pad/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/123/anaconda3/envs/pad/lib/python3.8/multiprocessing/reduction.py", line 164, in recvfds
    raise RuntimeError('received %d items of ancdata' %
RuntimeError: received 0 items of ancdata
GCQi commented 1 year ago

Besides, it also show me the warning /data/123/gcq/LaneDetection/pytorch-auto-drive/utils/datasets/utils.py:30: UserWarning: An output with one or more elements was resized since it had shape [88473600], which does not match the required output shape [128, 3, 360, 640]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552411/work/aten/src/ATen/native/Resize.cpp:17.)

GCQi commented 1 year ago

And I changed the batch size to 128, maybe it caused the error?

voldemortX commented 1 year ago

And I changed the batch size to 128, maybe it caused the error?

Yes it is probably the reason, scale it down and see if the issue persists? Usually, this loading error accurs when parallel data loading is too heavy for your system.

GCQi commented 1 year ago

And I changed the batch size to 128, maybe it caused the error?

Yes it is probably the reason, scale it down and see if the issue persists? Usually, this loading error accurs when parallel data loading is too heavy for your system.

Now I change it to 64, and the error has not occured for now

GCQi commented 1 year ago

There comes a terrible thing that i still set the batch size is 64, and set the workers as 32, the error RuntimeError: received 0 items of ancdata appeared again.

GCQi commented 1 year ago

Besides, the train_augmentation as :

train_augmentation = dict(
    name='Compose',
    transforms=[
        dict(
            name='Resize',
            size_image=(360, 640),
            size_label=(360, 640)
        ),
        dict(
            name='RandomHorizontalFlip',
            flip_prob=0.5
                ),
        dict(
            name='RandomRotation',
            degrees=10
                ),
        dict(
            name='ColorJitter',
            brightness=0.4,
            contrast=0.4,
            saturation=0.4,
            hue=0.2
        ),
        dict(
            name='ToTensor'
        ),
        dict(
            name='RandomLighting',
            mean=0.0,
            std=0.1,
            eigen_value=[0.00341571, 0.01817699, 0.2141788],
            eigen_vector=[
                [0.41340352, -0.69563484, -0.58752847],
                [-0.81221408, 0.00994535, -0.5832747],
                [0.41158938, 0.71832671, -0.56089297]
            ]
        ),
        dict(
            name='Normalize',
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225],
            normalize_target=True
        )
    ]
)
GCQi commented 1 year ago

Have you ever encountered this problem before? I can not get the useful message from the error message.

voldemortX commented 1 year ago

@GCQi In my experience, this problem comes with heavy data loading (according to your hardware). Large batch size, more workers, and long training schedule increase the probability to encounter this error, which could happen halfway through training. You may find that my default batch size is kept at 20 for this very reason.

voldemortX commented 1 year ago

Sometimes the file_system strategy could help, but it has a memory leak issue of its own.

https://github.com/voldemortX/pytorch-auto-drive/blob/f2615da8063640e8c15029b52eeb026d6808ee7d/main_landet.py#L9

GCQi commented 1 year ago

OK. thanks for your help !! This open frame work is pretty good, thanks for your contirbution and great work