How to control the thread number on Tensorflow?

ghost commented 6 years ago

Hi @mnuke,

Do we have any options to control the number of threads in TF-Slim both in training and evaluation processes?

Specifically, I use this network for my classification problem. I changed the evaluation part in a way that runs train and evaluation in parallel like your code. I can run it on my own CPU without any problem. But I can't execute them on a supercomputer. It seems that it is related to the very large number of threads which are being created by Tensorflow. If the number of threads exceeds the maximum number of threads pre-set in SLURM (= 28) then the job will fail. Since it's unable to create new threads it will end up with error "resource temporarily unavailable".

This error provided when the code tries to restore parameters from checkpoints. If there is no limitation on the number of threads (like on my pc) it works fine:

INFO:tensorflow:Restoring parameters from ./model.ckpt-0
INFO:tensorflow:Starting evaluation at
I tensorflow/core/kernels/logging_ops.cc:79] eval/Accuracy[0]
I tensorflow/core/kernels/logging_ops.cc:79] eval/Recall_5[0]
INFO:tensorflow:Evaluation [1/60]

However, when there is a limitation on the number of threads (like SLURM job submission on supercomputers) we get:

INFO:tensorflow:Restoring parameters from ./model.ckpt-0
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable

I tried to limit the number of CPU threads used by Tensorflow to 1 by creating config like:

 FLAGS.num_preprocessing_threads=1
  config = tf.ConfigProto()
  config.intra_op_parallelism_threads = FLAGS.num_preprocessing_threads
  config.inter_op_parallelism_threads = FLAGS.num_preprocessing_threads

    slim.evaluation.evaluation_loop(
        master=FLAGS.master,
        checkpoint_path=each_ckpt,
        logdir=FLAGS.eval_dir,
        num_evals=num_batches,
        eval_op=list(names_to_updates.values()) + print_ops,
        variables_to_restore=variables_to_restore,
        session_config=config)

But unfortunately, that didn't help. In my opinion, the main problem we are having here is the fact that we are not able to control the number of threads here. Although we set it to 1 with various TF options you can actually see that this job is creating many more threads on the node:

slurm_script─┬─python───128[{python}] └─python───8[{python}]

Training script is creating 128 threads and evaluation script is creating 8 (both numbers vary over time).

Any idea on the way to control the thread numbers will be highly appreciated because I do need to fix this issue urgently. Ellie

P.S. I'm using Python 2.7.13 and Tensorflow 1.3.0.

mnoukhov commented 6 years ago

Hey, I don't really know the answer to your questions, but do you know what is creating these threads? slim is just a wrapper on TF so it shouldn't be creating that many threads. My best guess is that it is the data loading queue runners, so I'd check that first

mnoukhov commented 6 years ago

Closing this, let me know if you've found that the problem is in this tutorial and I should reopen

ghost commented 6 years ago

For Tensorflow code, when you add FLAGS.num_preprocessing_threads=1 config = tf.ConfigProto() config.intra_op_parallelism_threads = FLAGS.num_preprocessing_threads config.inter_op_parallelism_threads = FLAGS.num_preprocessing_threads

you can control the number of threads. But in TFSlim code it doesn't work.

mnoukhov / tf-slim-mnist

How to control the thread number on Tensorflow? #5