Open amiltonwong opened 5 years ago
Yeah, this is a known issue, potentially due to subprocess not properly terminated. The results shall be fine though as it only happens after the final iteration.
Yeah, this is a known issue, potentially due to subprocess not properly terminated. The results shall be fine though as it only happens after the final iteration.
I also encountered the same problem, but it happened when the epoch was 25. How can I solve it?
One solution may be to use ThreadPool instead of Pool in the training script. so for fill_queues
function in train.py
from multiprocessing.pool import ThreadPool
import multiprocessing as mp
###### Portion with fill_queues function
def fill_queues(
stack_train, stack_validation, num_train_batches, num_validation_batches
):
"""
Args:
stack_train: mp.Queue to be filled asynchronously
stack_validation: mp.Queue to be filled asynchronously
num_train_batches: total number of training batches
num_validation_batches: total number of validationation batches
"""
pool = ThreadPool(mp.cpu_count())
###### Fill in remaining code
See if that works.
Hi, all,
I got the following
broken pipe
error when the training came to the final epoch (epoch=499):Is something configured wrong or just to ignore such error?
My environment is: tensorflow 1.12 cuda 9.0 + cudnn 7.5