pudae / tensorflow-densenet

Tensorflow-DenseNet with ImageNet Pretrained Models
Apache License 2.0
169 stars 59 forks source link

How to control the thread number on Tensorflow? #7

Closed ghost closed 6 years ago

ghost commented 6 years ago

Hi @pudae ,

Do we have any options to control the number of threads in TF-Slim both in training and evaluation processes?

Specifically, I use this network for my classification problem. I changed the evaluation part in a way that runs train and evaluation in parallel like this code. I can run it on my own CPU without any problem. But I can't execute them on a supercomputer. It seems that it is related to the very large number of threads which are being created by Tensorflow. If the number of threads exceeds the maximum number of threads pre-set in SLURM (= 28) then the job will fail. Since it's unable to create new threads it will end up with error "resource temporarily unavailable".

This error provided when the code tries to restore parameters from checkpoints. If there is no limitation on the number of threads (like on my pc) it works fine:

INFO:tensorflow:Restoring parameters from ./model.ckpt-0
INFO:tensorflow:Starting evaluation at
I tensorflow/core/kernels/logging_ops.cc:79] eval/Accuracy[0]
I tensorflow/core/kernels/logging_ops.cc:79] eval/Recall_5[0]
INFO:tensorflow:Evaluation [1/60]

However, when there is a limitation on the number of threads (like SLURM job submission on supercomputers) we get:

INFO:tensorflow:Restoring parameters from ./model.ckpt-0
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable

I tried to limit the number of CPU threads used by Tensorflow to 1 by creating config like:

 FLAGS.num_preprocessing_threads=1
  config = tf.ConfigProto()
  config.intra_op_parallelism_threads = FLAGS.num_preprocessing_threads
  config.inter_op_parallelism_threads = FLAGS.num_preprocessing_threads

    slim.evaluation.evaluation_loop(
        master=FLAGS.master,
        checkpoint_path=each_ckpt,
        logdir=FLAGS.eval_dir,
        num_evals=num_batches,
        eval_op=list(names_to_updates.values()) + print_ops,
        variables_to_restore=variables_to_restore,
        session_config=config)

But unfortunately, that didn't help. In my opinion, the main problem we are having here is the fact that we are not able to control the number of threads here. Although we set it to 1 with various TF options you can actually see that this job is creating many more threads on the node:

slurm_script─┬─python───128[{python}] └─python───8[{python}]

Training script is creating 128 threads and evaluation script is creating 8 (both numbers vary over time).

Any idea on the way to control the thread numbers will be highly appreciated because I do need to fix this issue urgently. Ellie

P.S. I'm using Python 2.7.13 and Tensorflow 1.3.0.

pudae commented 6 years ago

Hi~ @Ellie68

Sorry to late answer. It seems that there is no way to reduce number of threads with using slim. I tested evaluation scripts with num_preprocessing_threads=1, then I found that slim create 6 threads for input pipeline and 1 thread for event logger.

I haven't used new Dataset API yet, I expect it will help to solve the issue. thanks.

<Thread(QueueRunnerThread-parallel_read/filenames-parallel_read/filenames/filenames_EnqueueMany, started daemon 140364409132800)>
<Thread(QueueRunnerThread-parallel_read/filenames-close_on_stop, started daemon 140364417525504)>
<Thread(QueueRunnerThread-parallel_read/common_queue-parallel_read/common_queue_enqueue, started daemon 140364849583872)>
<Thread(QueueRunnerThread-parallel_read/common_queue-close_on_stop, started daemon 140364857976576)>
<Thread(QueueRunnerThread-batch/fifo_queue-batch/fifo_queue_enqueue, started daemon 140364891547392)>
<Thread(QueueRunnerThread-batch/fifo_queue-close_on_stop, started daemon 140364883154688)>
<_EventLoggerThread(Thread-1, started daemon 140364874761984)>
ghost commented 6 years ago

Thanks for your answer @pudae