Broken pipe error - Githubissues

amiltonwong commented 5 years ago

Hi, all,

I got the following broken pipe error when the training came to the final epoch (epoch=499):

in epoch 498
max_epoch 500
**** EPOCH 498 ****
2019-02-26 05:38:35.735413
Progress: [##########] 100%mean loss: 0.082965
Overall accuracy : 0.991698
Average IoU : 0.963038
IoU of man-made terrain : 0.970558
IoU of natural terrain : 0.981679
IoU of high vegetation : 0.994769
IoU of low vegetation : 0.937876
IoU of buildings : 0.993800
IoU of hard scape : 0.939500
IoU of scanning artifact : 0.923614
IoU of cars : 0.962506
in epoch 499
max_epoch 500
**** EPOCH 499 ****
2019-02-26 05:39:40.278005
Progress: [##########] 100%mean loss: 0.077413
Overall accuracy : 0.992196
Average IoU : 0.962089
IoU of man-made terrain : 0.971449
IoU of natural terrain : 0.982647
IoU of high vegetation : 0.996347
IoU of low vegetation : 0.935048
IoU of buildings : 0.994343
IoU of hard scape : 0.937210
IoU of scanning artifact : 0.921430
IoU of cars : 0.958242
Process ForkPoolWorker-1:1:
Traceback (most recent call last):
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/pool.py", line 125, in worker
    put((job, i, result))
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/queues.py", line 347, in put
    self._writer.send_bytes(obj)
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 397, in _send_bytes
    self._send(header)
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/pool.py", line 130, in worker
    put((job, i, (False, wrapped)))
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/queues.py", line 347, in put
    self._writer.send_bytes(obj)
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
(tf) root@milton-ThinkCentre-M93p:/data/code8/Open3D-PointNet2-Semantic3D#

Is something configured wrong or just to ignore such error?

My environment is: tensorflow 1.12 cuda 9.0 + cudnn 7.5

yxlao commented 5 years ago

Yeah, this is a known issue, potentially due to subprocess not properly terminated. The results shall be fine though as it only happens after the final iteration.

it-kaola commented 3 years ago

Yeah, this is a known issue, potentially due to subprocess not properly terminated. The results shall be fine though as it only happens after the final iteration.

I also encountered the same problem, but it happened when the epoch was 25. How can I solve it?

fibo11235 commented 2 years ago

One solution may be to use ThreadPool instead of Pool in the training script. so for fill_queues function in train.py

from multiprocessing.pool import ThreadPool
import multiprocessing as mp

###### Portion with fill_queues function
def fill_queues(
    stack_train, stack_validation, num_train_batches, num_validation_batches
):
    """
    Args:
        stack_train: mp.Queue to be filled asynchronously
        stack_validation: mp.Queue to be filled asynchronously
        num_train_batches: total number of training batches
        num_validation_batches: total number of validationation batches
    """

    pool = ThreadPool(mp.cpu_count())
###### Fill in remaining code

See if that works.

isl-org / Open3D-PointNet2-Semantic3D

Broken pipe error #41