jwyang / fpn.pytorch

Pytorch implementation of Feature Pyramid Network (FPN) for Object Detection
MIT License
952 stars 221 forks source link

collapses after the trainval_net.py finishing its first training iter #11

Closed destinyzs closed 6 years ago

destinyzs commented 6 years ago

I am training my own dataset using fpn, i set the num_workers > 0 like =4, and then it collapses after the first training iter, but it will be ok if num_workers = 0, my device is following: OS: Ubuntu 16.04 PyTorch version: pytorch 0.3.1 Python version: python 2.7 CUDA version: 8.0 GPU models : Tesla P40

[session 3][epoch  1][iter    0] loss: 4.3830, lr: 1.00e-03
            fg/bg=(115/397), time cost: 5.939724
            rpn_cls: 0.6945, rpn_box: 1.2795, rcnn_cls: 2.3964, rcnn_box 0.0127
Traceback (most recent call last):
  File "trainval_net.py", line 339, in <module>
    data = data_iter.next()
  File "/root/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 204, in __next__
    idx, batch = self.data_queue.get()
  File "/root/anaconda2/lib/python2.7/multiprocessing/queues.py", line 378, in get
    return recv()
  File "/root/anaconda2/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
    return pickle.loads(buf)
  File "/root/anaconda2/lib/python2.7/pickle.py", line 1388, in loads
    return Unpickler(file).load()
  File "/root/anaconda2/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/root/anaconda2/lib/python2.7/pickle.py", line 1139, in load_reduce
    value = func(*args)
  File "/root/anaconda2/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
    fd = multiprocessing.reduction.rebuild_handle(df)
  File "/root/anaconda2/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
    conn = Client(address, authkey=current_process().authkey)
  File "/root/anaconda2/lib/python2.7/multiprocessing/connection.py", line 169, in Client
    c = SocketClient(address)
  File "/root/anaconda2/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
    s.connect(address)
  File "/root/anaconda2/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused

i saw the same problem https://github.com/pytorch/pytorch/issues/1355, but his python verison is 3.x, and i tried that solution, it does not works. @jwyang

destinyzs commented 6 years ago

i also tried the solution https://github.com/jwyang/faster-rcnn.pytorch/issues/150, and his python is 3.x version, in 2.7 there is no set_start_method in multiprocessing.

neulrl commented 6 years ago

you can try other methods mentioned in https://github.com/pytorch/pytorch/issues/1355 like:

The first method had solved my ploblem too. But it make code run slower.

destinyzs commented 6 years ago

thanks! actually, i have solved it by set cv2.setNumThreads(0). anyway, thanks for your response. @neulrl