ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.16k stars 3.44k forks source link

multi-gpu: "nan loss detected, ending training" #434

Closed zhzgithub closed 5 years ago

zhzgithub commented 5 years ago

When I use one gpu, it works well. But when I use 2 gpus,the following situation has occurred:

Namespace(accumulate=2, batch_size=4, bucket='', cfg='cfg/yolov3-spp.cfg', data='data/coco.data', epochs=273, evolve=False, gpus='0,1', img_size=416, img_weights=False, multi_scale=True, nosave=False, notest=False, num_workers=32, rect=False, resume=False, transfer=False, xywh=False)
Using CUDA Apex device0 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11178MB)
                device1 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11178MB)

Using multi-scale 320 - 608
Reading labels (117264 found, 0 missing, 0 empty for 117264 images): 100%|███████████████████████████████████████| 117264/117264 [00:11<00:00, 10167.36it/s]
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients

     Epoch   gpu_mem   GIoU/xy        wh       obj       cls     total   targets  img_size
  0%|                                                                                                     | 0/29316 [00:00<?, ?it/s]WARNING: nan loss detected, ending training
Exception ignored in: <bound method tqdm.__del__ of   0%|                                                                                                                                                                                     | 0/29316 [00:21<?, ?it/s]>
Traceback (most recent call last):
  File "/home/cidi/.local/lib/python3.5/site-packages/tqdm/_tqdm.py", line 885, in __del__
    self.close()
  File "/home/cidi/.local/lib/python3.5/site-packages/tqdm/_tqdm.py", line 1090, in close
    self._decr_instances(self)
  File "/home/cidi/.local/lib/python3.5/site-packages/tqdm/_tqdm.py", line 454, in _decr_instances
    cls.monitor.exit()
  File "/home/cidi/.local/lib/python3.5/site-packages/tqdm/_monitor.py", line 52, in exit
    self.join()
  File "/usr/lib/python3.5/threading.py", line 1051, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

How can I solve this problem? Help,please!

glenn-jocher commented 5 years ago

I’m not sure, but 32 workers is way too many if your batch size is 4.