Closed zhzgithub closed 5 years ago
When I use one gpu, it works well. But when I use 2 gpus,the following situation has occurred:
Namespace(accumulate=2, batch_size=4, bucket='', cfg='cfg/yolov3-spp.cfg', data='data/coco.data', epochs=273, evolve=False, gpus='0,1', img_size=416, img_weights=False, multi_scale=True, nosave=False, notest=False, num_workers=32, rect=False, resume=False, transfer=False, xywh=False) Using CUDA Apex device0 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11178MB) device1 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11178MB) Using multi-scale 320 - 608 Reading labels (117264 found, 0 missing, 0 empty for 117264 images): 100%|███████████████████████████████████████| 117264/117264 [00:11<00:00, 10167.36it/s] Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients Epoch gpu_mem GIoU/xy wh obj cls total targets img_size 0%| | 0/29316 [00:00<?, ?it/s]WARNING: nan loss detected, ending training Exception ignored in: <bound method tqdm.__del__ of 0%| | 0/29316 [00:21<?, ?it/s]> Traceback (most recent call last): File "/home/cidi/.local/lib/python3.5/site-packages/tqdm/_tqdm.py", line 885, in __del__ self.close() File "/home/cidi/.local/lib/python3.5/site-packages/tqdm/_tqdm.py", line 1090, in close self._decr_instances(self) File "/home/cidi/.local/lib/python3.5/site-packages/tqdm/_tqdm.py", line 454, in _decr_instances cls.monitor.exit() File "/home/cidi/.local/lib/python3.5/site-packages/tqdm/_monitor.py", line 52, in exit self.join() File "/usr/lib/python3.5/threading.py", line 1051, in join raise RuntimeError("cannot join current thread") RuntimeError: cannot join current thread
How can I solve this problem? Help,please!
I’m not sure, but 32 workers is way too many if your batch size is 4.
When I use one gpu, it works well. But when I use 2 gpus,the following situation has occurred:
How can I solve this problem? Help,please!