ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.86k stars 16.38k forks source link

Two Erros if i train custom dataset in Jupypter Notebook tutorial.ipynb #3260

Closed ghost closed 3 years ago

ghost commented 3 years ago

❔Question

Hello, when I start the training in tutorial.ipynb I get a couple of erros in the output and the training stops.

# Train YOLOv5s on COCO128 for 3 epochs
!python train.py --img 640 --batch 16 --epochs 3 --data schwarzwild.yaml --weights yolov5s.pt --nosave --cache

You could see the output below. It has something to do with memory if I understand it correctly. The question now is what is the solution here?

Thank you

Additional context

github: skipping check (not a git repository)

YOLOv5 πŸš€ 2021-5-18 torch 1.8.1+cu102 CUDA:0 (Tesla K80, 11441.1875MB)
                                     CUDA:1 (Tesla K80, 11441.1875MB)
                                     CUDA:2 (Tesla K80, 11441.1875MB)
                                     CUDA:3 (Tesla K80, 11441.1875MB)

Namespace(adam=False, artifact_alias='latest', batch_size=16, bbox_interval=-1, bucket='', cache_images=True, cfg='', data='./data/schwarzwild.yaml', device='', entity=None, epochs=3, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], label_smoothing=0.0, linear_lr=False, local_rank=-1, multi_scale=False, name='exp', noautoanchor=False, nosave=True, notest=False, project='test', quad=False, rect=False, resume=False, save_dir='test/exp5', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=16, upload_dataset=False, weights='yolov5s.pt', workers=8, world_size=1)
tensorboard: Start with 'tensorboard --logdir test', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0

Overriding model.yaml nc=80 with nc=5

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  1    156928  models.common.C3                        [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  1    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1     26970  models.yolo.Detect                      [5, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 283 layers, 7074330 parameters, 7074330 gradients

Transferred 356/362 items from yolov5s.pt
Scaled weight_decay = 0.0005
Optimizer groups: 62 .bias, 62 conv.weight, 59 other
train: Scanning '../train_data_100/labels/train.cache' images and labels... 80 f
train: Caching images (0.1GB): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 80/80 [00:01<00:00, 53.18it/s]
val: Scanning '../train_data_100/labels/val.cache' images and labels... 19 found
val: Scanning '../train_data_100/labels/val.cache' images and labels... 19 found
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_1872_1620762766>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_1952_3660469347>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_1904_2151056991>
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_1936_1188487916>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_1968_2006413901>

val: Caching images (0.0GB):   5%|β–Š              | 1/19 [00:00<00:04,  3.82it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_1936_827499676>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_1904_836034397>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_1968_180740129>
val: Caching images (0.0GB):  42%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž        | 8/19 [00:00<00:00, 19.09it/s]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 851, in next
    item = self._items.popleft()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 542, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 199, in train
    testloader = create_dataloader(test_path, imgsz_test, batch_size * 2, gs, opt,  # testloader
  File "/home/jovyan/lese1023/Dokumente und Einstellungen/yolov5-master/utils/datasets.py", line 63, in create_dataloader
    dataset = LoadImagesAndLabels(path, imgsz, batch_size,
  File "/home/jovyan/lese1023/Dokumente und Einstellungen/yolov5-master/utils/datasets.py", line 449, in __init__
    for i, x in pbar:
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1176, in __iter__
    for obj in iterable:
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 856, in next
    self._cond.wait(timeout)
  File "/opt/conda/lib/python3.8/threading.py", line 302, in wait
    waiter.acquire()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1920) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
glenn-jocher commented 3 years ago

@cyberFoxi your machine is out of memory.

github-actions[bot] commented 3 years ago

πŸ‘‹ Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 πŸš€ resources:

Access additional Ultralytics ⚑ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 πŸš€ and Vision AI ⭐!