lzx1413 / PytorchSSD

pytorch version of SSD and it's enhanced methods such as RFBSSD,FSSD and RefineDet
MIT License
708 stars 237 forks source link

0.4branch, refinedet, training, OC&OL loss nan and memory error. #60

Open liuruijin17 opened 5 years ago

liuruijin17 commented 5 years ago

Thanks for your code! Here when I use 0.4 branch, refinedet_train_test.py to train My platform is ubuntu1604, python3.6, pytorch0.4.1, 1080ti I got following bug log WHEN I set batch size >=8:

Loading base network...
Initializing weights...
Loading Dataset...
Training Refine_vgg on VOC0712
Epoch:1 || epochiter: 0/2068|| Total iter 0 || AL: 0.2553 AC: 0.4472 OL: 0.2597 OC: 1.2308||Batch time: 5.4207 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 10/2068|| Total iter 10 || AL: 3.1427 AC: 4.1780 OL: 3.0403 OC: 12.2006||Batch time: 0.1865 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 20/2068|| Total iter 20 || AL: 3.0641 AC: 3.2415 OL: 2.9984 OC: 12.0661||Batch time: 0.1881 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 30/2068|| Total iter 30 || AL: 2.9390 AC: 4.4274 OL: 2.9430 OC: 11.8935||Batch time: 0.1895 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 40/2068|| Total iter 40 || AL: 45.4877 AC: 101.7261 OL: nan OC: 21.0359||Batch time: 0.1875 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 50/2068|| Total iter 50 || AL: 3.1931 AC: 5.3219 OL: nan OC: nan||Batch time: 0.1846 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 60/2068|| Total iter 60 || AL: 2.8870 AC: 2.6611 OL: nan OC: nan||Batch time: 0.1817 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 70/2068|| Total iter 70 || AL: 3.0326 AC: 2.6308 OL: nan OC: nan||Batch time: 0.1834 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 80/2068|| Total iter 80 || AL: 3.0734 AC: 2.6007 OL: nan OC: nan||Batch time: 0.1819 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 90/2068|| Total iter 90 || AL: 3.1231 AC: 2.5809 OL: nan OC: nan||Batch time: 0.1827 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 100/2068|| Total iter 100 || AL: 2.7920 AC: 2.5576 OL: nan OC: nan||Batch time: 0.1796 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 110/2068|| Total iter 110 || AL: 2.9646 AC: 2.5389 OL: nan OC: nan||Batch time: 0.1817 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 120/2068|| Total iter 120 || AL: 3.0986 AC: 2.5231 OL: nan OC: nan||Batch time: 0.1818 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 130/2068|| Total iter 130 || AL: 2.8665 AC: 2.5052 OL: nan OC: nan||Batch time: 0.1830 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 140/2068|| Total iter 140 || AL: 2.9167 AC: 2.4903 OL: nan OC: nan||Batch time: 0.1809 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 150/2068|| Total iter 150 || AL: 2.8481 AC: 2.4771 OL: nan OC: nan||Batch time: 0.1809 sec. ||LR: 0.00200000
Epoch:1 || epochiter: 160/2068|| Total iter 160 || AL: 2.8456 AC: 2.4619 OL: nan OC: nan||Batch time: 0.1821 sec. ||LR: 0.00200000
Traceback (most recent call last):
  File "refinedet_train_test.py", line 416, in <module>
    train()
  File "refinedet_train_test.py", line 280, in train
    odm_loss_l, odm_loss_c = odm_criterion((odm_loc,odm_conf),priors,targets,(arm_loc,arm_conf),False)
  File "/home/vision01/anaconda3/envs/python3.6pytorch0.4.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/vision01/Workspace/Python/Detection/PytorchSSD-0.4/layers/modules/refine_multibox_loss.py", line 103, in forward
    _,loss_idx = loss_c.sort(1, descending=True)
RuntimeError: merge_sort: failed to synchronize: an illegal memory access was encountered
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fc21c157a90>>
Traceback (most recent call last):
  File "/home/vision01/anaconda3/envs/python3.6pytorch0.4.1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 399, in __del__
    self._shutdown_workers()
  File "/home/vision01/anaconda3/envs/python3.6pytorch0.4.1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
    self.worker_result_queue.get()
  File "/home/vision01/anaconda3/envs/python3.6pytorch0.4.1/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/vision01/anaconda3/envs/python3.6pytorch0.4.1/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
    fd = df.detach()
  File "/home/vision01/anaconda3/envs/python3.6pytorch0.4.1/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/vision01/anaconda3/envs/python3.6pytorch0.4.1/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/vision01/anaconda3/envs/python3.6pytorch0.4.1/lib/python3.6/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/home/vision01/anaconda3/envs/python3.6pytorch0.4.1/lib/python3.6/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

Q(1): It seems that it`s not an overflow of gpu memory, cause I watch my gpu memory uage is half of my 1080ti gpu memory. Q(2): Loss seems strange, AL and AC start from small, and suddenly become big then go to stable, OL and OC start from small and finally become nan

WHEN I set batch size <8 Q(1) is gone, Q(2) stills exists

WangTianYuan commented 5 years ago

@liuruijin17 Hello, Have you solved this problem? How?

liuruijin17 commented 5 years ago

@WangTianYuan Haven`t. Use a tensorflow implementation now.