Closed ryan-summers closed 6 years ago
@ryan-summers I'm not exactly sure how we can help, as it's not clear whether the issue is from running on the Jetson TX2 or from object_detection or from something else, and I'm not sure who could replicate your bug here.
I'll ask @jch1 and @tombstone if they're aware of anything relevant, but I think we need help from the wider TF community on this one.
@ryan-summers I believe this is due to the Jetson running out of memory. I had a similar problem when testing SSD Inception-v2-resnet and all versions of Faster R-CNN. Try to use MobileNet with SSD. That definitely works.
@oneTimePad I'm inclined to believe that you're correct. I'll try to run some more tests to see if MobileNet generates the same errors, but the initial work I have done indicates MobileNet will run successfully.
As a follow up, I continued digging into the problem and noticed that the error never occurs when running TensorFlow in CPU mode (e.g. setting CUDA_VISIBLE_DEVICES=-1).
I know that the Jetson TX2 is slightly weird in that the GPU and CPU memories are shared, but does anyone know of a way to specify that a specific amount of memory be dedicated to the tensorflow session? Monitoring the Jetson TX2 memory usage when SSD inception properly spins up doesn't seem to indicate that a majority of the memory is being used, but it also appears to only be a problem at initialization (once the inference graph is loaded and the first inference is performed, the graph will consistently perform inferences).
@ryan-summers
Fraction of memory:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
config=tf.ConfigProto(gpu_options=gpu_options)
Dynamic Growth:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
Thanks @oneTimePad - I implemented a static GPU memory fraction at 0.5 and initial tests are showing that the error disappears. I am now heavily inclined to believe the error is related to GPU memory usage - it would also explain the sporadic nature of the error (as the Linux system uses more memory for other processes, it would limit CPU/GPU memory on the Jetson and cause errors). I'm going to continue with some regression testing today and will close this issue if the error doesn't return within the next few hours.
Thanks for all of the help!
I'm going to be closing this issue, as it appears that the issue has been resolved by limiting the amount of GPU memory on the Jetson.
System information
Describe the problem
Sporadically, image inference will fail on the Jetson TX2 using a custom-trained SSD-inception-v2 network. The network functions normally on desktops employing identical versions of tensorflow, Cuda, and cuDNN that employ both GPU- and CPU-accelerated environments, but will often begin throwing the following error. The network has worked a number of times in the past, but will begin throwing CUDA errors sporadically.
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.
Below is the
infer_image.py
script used to reproduce. I can also provide the pre-trained network and labels file if necessary.