Open irnwritshin opened 7 years ago
This is not a solution, but may help you on your journey to find a solution. This is the error you're getting, and I think I've seen something similar before.
Cannot assign a device for operation 'tower_1/mask_rcnn/strided_slice_13': Could not satisfy explicit device specification '/device:GPU:1' because no supported kernel for GPU devices is available
I was trying to setup TensorFlow debugging and needed to change some setting in TF session. By default, TF tries to put OPs on the device you specify but if it can't then it puts it on the CPU. The change I did (sorry, I don't remember it now) caused TF to try to strictly enforce device placement. Some OPs don't have a GPU implementation, so they can't be placed on the GPU.
Either try to find out what you changed that caused TF to enforce device placement. Or, manually find all the OPs that don't have GPU implementation and wrap them with tf.device() that explicitly force them to be on the CPU.
@waleedka Thank you so much for the reply, it's actually very helpful !
So, I've located the probem: The strided_slice is used when I manually configured tensorflow to limit GPU memory usage. If I disable the config, there is no bug. The problem is half solved !
However, I never run into this problem when I'm running on one GPU, which led me to look up if some incompatibility is coming from the encapsulation part of parallel_model.py. So far I didn't find anything yet.
Another weird thing that I can't get my head around is that the problem seems to only occur on the second device(GPU:1). Any thoughts?
When you use one GPU, the parallel model is not called. Which means that tf.device()
is not used, and therefore OPs placement follows the default setting (put on GPU if available, otherwise put on CPU). Once you use GPU > 1, parallel model is used and it tries to direct TF to put OPs on specific GPUs.
@waleedka Thank you for the reply, I'll post the solution here if I found one later on.
Hi, excellent project, I had fun reading your code.
I did some expriments by reconfiguring the codes and the pretrained coco weight to adapt python2.7 and a 2 class problem. I also disabled multiprocessing as in #13 because I don't have root access on shared memory of the machine.
It worked fine when I was training on one GPU, but when I tried to run on multiple GPUs I run into a problem like this:
This exception seems to be raised while the coco weights are being loaded. However, the parallel_model.py file runs perfectly on its test code.
I did many searches and I can't solve this one. Is there anyone have a similar problem?
I'm running on Keras(2.0.8) and Tensorflow(1.4.0)