Closed ghost closed 6 years ago
Hey, I don't really know the answer to your questions, but do you know what is creating these threads? slim is just a wrapper on TF so it shouldn't be creating that many threads. My best guess is that it is the data loading queue runners, so I'd check that first
Closing this, let me know if you've found that the problem is in this tutorial and I should reopen
For Tensorflow code, when you add
FLAGS.num_preprocessing_threads=1 config = tf.ConfigProto() config.intra_op_parallelism_threads = FLAGS.num_preprocessing_threads config.inter_op_parallelism_threads = FLAGS.num_preprocessing_threads
you can control the number of threads. But in TFSlim code it doesn't work.
Hi @mnuke,
Do we have any options to control the number of threads in TF-Slim both in training and evaluation processes?
Specifically, I use this network for my classification problem. I changed the evaluation part in a way that runs train and evaluation in parallel like your code. I can run it on my own CPU without any problem. But I can't execute them on a supercomputer. It seems that it is related to the very large number of threads which are being created by Tensorflow. If the number of threads exceeds the maximum number of threads pre-set in SLURM (= 28) then the job will fail. Since it's unable to create new threads it will end up with error "resource temporarily unavailable".
This error provided when the code tries to restore parameters from checkpoints. If there is no limitation on the number of threads (like on my pc) it works fine:
However, when there is a limitation on the number of threads (like SLURM job submission on supercomputers) we get:
I tried to limit the number of CPU threads used by Tensorflow to 1 by creating config like:
But unfortunately, that didn't help. In my opinion, the main problem we are having here is the fact that we are not able to control the number of threads here. Although we set it to 1 with various TF options you can actually see that this job is creating many more threads on the node:
slurm_script─┬─python───128[{python}] └─python───8[{python}]
Training script is creating 128 threads and evaluation script is creating 8 (both numbers vary over time).
Any idea on the way to control the thread numbers will be highly appreciated because I do need to fix this issue urgently. Ellie
P.S. I'm using Python 2.7.13 and Tensorflow 1.3.0.