Closed MyVanitar closed 4 years ago
In the image below it is clear that the GPU memory has filled and clock has increased but no load on the cores. The testing software is correct. I have tested it on many other testings and it is correct. besides 100% CPU is also another evidence.
Seems like the GPU is being used. How are you loading your input data? Could be that that part works intensively on the CPU.
I just followed the procedure. I have created train.record
and val.record
. it accompanies with a config file, a label_map.pbtxt
and a pretrained weights.
Then I run training in the console. as you see it detects the GPU, but the load goes to the CPU. I have tested it both with tensorflow-gpu 1.5 and 1.6.rc0
I don't know it is related or not, but I trained till 800 steps but the loss plays around 2 and 3.
@VanitarNordic are you convinced that this is a bug, and not just some sort of configuration thing? Would you please fill out the usual platform/configuration/reproducibility part of the standard report?
Please provide details about what platform you are using (operating system, architecture). Also include your TensorFlow version. Also, did you compile from source or install a binary? Make sure you also include the exact command if possible to produce the output included in your test case. If you are unclear what to include see the issue template displayed in the Github new issue template.
We ask for this in the issue submission template, because it is really difficult to help without that information. Thanks!
@cy89
Most likely it is a bug, because I have tested many things and I followed a standard procedures. During my google search I saw some other people had also reported something like this. if you feel it is necessary, I'll find where it was.
Platform: Windows10-x64
CUDA: Cuda-9.0.176.1 and CuDNN-7.0.5 - GTX 1060 6G GPU
Tested by these Tensorflow versions: 1.5 and 1.6-rc0 (both shows a similar behavior). Installed through pip (pip install tensorflow-gpu
)
Training command: (it starts training but with this behavior)
python train.py --logtostderr --train_dir=results/train --peline_config_path=weight/ssd_inception_v2_coco.config
in this issue some users have encountered the same problem: https://github.com/tensorflow/tensorflow/issues/12388#issuecomment-365081928
I'm getting similar behaviour to what @VanitarNordic describes:
Platform: Windows7-x64
CUDA: Cuda-9.0.176 and CuDNN-7.0.5 - GTX 650
Using: Tensorflow-gpu versions: 1.5, Installed through pip pip install --ignore-installed --upgrade tensorflow-gpu
as per official tensorflow install instructions for anaconda.
When I run mnist_test.py from https://www.tensorflow.org/tutorials/layers i get cpu under heavy load, no change in gpu load, and when I run nvidia-smi
it detects the gpu, but no processes are visible. Also, if I run sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
it successfully finds my gpu.
Augh, it seemed to be a driver issue for me: I uninstalled, re-downloaded, and re-installed my gpu drivers, restarted my computer, and it seems to be working fine now!
@adriancar
Make sure that it is working and test its operation by the open-hardware-monitor software, because I have done these steps many times but still it does not work. The load is on the CPU. Also make sure that you are using the latest commit
@VanitarNordic
After reinstalling the driver, when executing training, hardware monitor shows my GPU gets pegged at 100%, while CPU sits at ~25%.
@adriancar
I'm using the latest driver and I had cleaned everything before. Are you using the last commit?
I used a fresh tensorflow install on a fresh anaconda env. I fetched it using the instructions on tensorflow website - version 1.5.0.
@adriancar
No, what I mean from the latest commit, is the tensorflow detection API repository. when have you downloaded and used the repository?
I used the MNIST CNN tutorial to test if the GPU was being used: https://www.tensorflow.org/tutorials/layers
It seems like the published tensorflow-gpu wheel has been a problem for Windows users for a while now: https://github.com/tensorflow/models/issues/1942#issuecomment-316023323 It runs 10 times slower with the GPU for me vs. just using the CPU
Unfortunately I have not been able to successfully compile it for myself, as there are compilation errors https://github.com/tensorflow/tensorflow/issues/16138
I just clean install the latest driver for Titan Xp 390.77 on Windows 7, but training still use CPU but not GPU.
@civilman628
Yes, by the evidences it is a bug.
@VanitarNordic this should probably be moved https://github.com/tensorflow/tensorflow, no?
@CarltonSemple
I don't know because that might happen with the object detection API only. let the @cy89 decide about it.
I've had it happen with other things.
@VanitarNordic @CarltonSemple I'm not seeing in the comment stream whether you think this problem is all computations not using the GPU, or whether it's just ssd_inception_v2_coco.
I.e., @VanitarNordic if you run @adriancar 's tutorial MNIST example, do things work as expected?
@cy89 not only ssd inception v2, but also ssd mobile v1
@cy89 The computations use the GPU inefficiently (something like blinking but with a long off-periods) and push a constant 100% load on the CPU.
Tensorflow fills my GPU RAM then does all the work on the CPU... runs slower than if I force using just the CPU, what's going on here?
@rhys-saldanha
You are not alone. I hope they fix the bug as soon as possible
@VanitarNordic I am experiencing the same behavior. I am trying to use TensorFlow's object detection API. Platform: Ubuntu 16.04 64-bit CUDA: Cuda-9.0.176.1 and CuDNN-7.0.5 - GTX 1060 6GB GPU Tensorflow version: 1.5.0. Installed through pip (pip install tensorflow-gpu also tried tensorflow-gpu==1.5.0) Specifically, I am running SSD mobilenet using a webcam and I can see CPU load spike to 100% while GPU utilization is at 2-5% while temp is upto 58-60C and all of the available GPU memory is used. I am getting close to 0.5FPS which is not logical at all, considering TF is detecting my GPU. Hope this gets resolved soon.
@ParinithShekar
Hi. Yes you are not alone. at least 3 people reported it under this issue. I hope @cy89 consider it as soon as possible.
Any news on this? I have the same problem with tensorflow-gpu 1.6.0rc1, keras 2.1.4 and cuda 9.0 (on both Linux and Windows, single-GPU and multi-GPU).
I have a similar symptom:
While training with SSD Mobilenet(with ssd_mobilenet_v1_coco_2017_11_17 and research\object_detection\train.py), the GPU is not fully loaded, instead it jumps from 0~60% forming a "comb" shaped pattern in GPU-Z
However while trying the cifar-10 training example(tutorials\image\cifar10\cifar10_train.py), GPU usage keeps a solid/constant 90%
I then did some more experiments.
Changing image_resizer to 100100(from original value of 300300) in ssd_mobilenet_v1_coco.config yields a solid/constant gpu usage
Increasing or decreasing batch_size in the same config file changes nothing, still a "comb" shaped pattern
Since in all cases the same installation of tensorflow-gpu 1.6.0 are used, maybe there is some problem within the object detection api?
@cy89
You don't want to consider this?
Why my problem still exist. I tried all ways mentioned above, and seemed load in gpu successfully, but...
totalMemory: 3.95GiB freeMemory: 3.91GiB
2018-03-25 11:59:22.538312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1227] Device peer to peer matrix
2018-03-25 11:59:22.538353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1233] DMA: 0 1
2018-03-25 11:59:22.538360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 0: Y N
2018-03-25 11:59:22.538364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 1: N Y
2018-03-25 11:59:22.538374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0, 1
2018-03-25 11:59:23.892312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1616 MB memory) -> physical GPU (device: 0, name: Quadro K2200, pci bus id: 0000:03:00.0, compute capability: 5.0)
2018-03-25 11:59:23.913047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 1616 MB memory) -> physical GPU (device: 1, name: Quadro K2200, pci bus id: 0000:04:00.0, compute capability: 5.0)
2018-03-25 11:59:24.786177: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2018-03-25 11:59:24.787125: E tensorflow/stream_executor/cuda/cuda_dnn.cc:393] possibly insufficient driver version: 384.81.0
2018-03-25 11:59:24.787164: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
I had the same experience !
Platform: Windows 7 64bit CUDA: Cuda-9.0.176 and CuDNN-7.0.5 - GTX 1060 6GB GPU Tensorflow version: 1.8.0. Installed through pip
When runing train.py with ssd_mobilenet_v1 config, gpu loaded almost maximum memory (almost 6GB), but gpu is not used.
Sometimes it went to 90% usage, but most of the time, it showed 0%.
While cpu usage was always 100%.
Guys, Is anyone got solution for this problem? I am facing the same problem with TF 1.8 with cuda 9.0 and CUDNN 7.1 I am trying to train dynamic RNN model . With CPU Epoch duration is 2718.3 Seconds, however with GPU the same model takes 7344.9 Seconds. My System Configuration is as below: Lptop: Microsoft surface book Pro2 RAM: 8GB GPU: NVIDIA GForce GTX 965 M CPU: Intel I7 6600 U Quad core
I'm also facing the same problem. I have installed tensorflow-gpu 1.8, validated installation and uses gpu, but the python code, it says using tensforflow as backend, but still CPU memory is 100% and not using GPU .
Have you guys found the solution ?
I have opened this issue from a long time ago and others also introduce bugs for free but they don't consider them at all
Well issue is resolved for me. Here is what I did.
@Sri06006
that's a basic consideration for installing GPU related packages which majority of us have correctly installed them and it already shows the GPU
For me , the problem is not with GPU utilization. I can see the GPU utilization is there but the training takes more time when I am using GPU. It is more faster on CPU
I have also come across this (frustrating) issue.. using GTX 1080 ti and can detect GPU no problem in tensorflow. Have reinstalled drivers several times but still no joy.
Last week when I ran it on the CPU only version it ran perfectly fine for me (albeit slowly), but now the CPU usage flies up to 100% before the whole PC freezes even before the training queues start. A part of me is quite relieved to see that it's not juts me its happening to and that it very well could be a bug. Hopefully there is a fix available soon.
Same issue is happening to me on Tensorflow-GPU 1.9.0. Lots of CPU usage but only around 2.5% GPU usage.
EDIT: Forgot to mention I was using this repo
I got it working by using the following SSD_mobilenet at https://github.com/tensorflow/models/blob/master/object_detection/samples/configs/ssd_mobilenet_v1_pets.config
Also found at: https://pythonprogramming.net/training-custom-objects-tensorflow-object-detection-api-tutorial/
Hope it helps someone! :)
@kcobrien : Can you please explain a bit , how did you made the training faster with GPU. The link "https://github.com/tensorflow/models/blob/master/object_detection/samples/configs/ssd_mobilenet_v1_pets.config" is not working for me.
@nitinpapadkar Try to get the ssd_mobilenet from this link instead: https://pythonprogramming.net/training-custom-objects-tensorflow-object-detection-api-tutorial/
I am using tensorflow 1.8 and CUDA 9.0 and I made sure to test that my GPU is responsive in tensorflow.
Also worth noting that when my model is training my CPU usage stays between 70-99% (I assume because I am loading images into the model) but it is clearly using my GPU to train and the PC no longer freezes.
I am not sure if there is still a bug with tensorflow 1.8 and cuda 9.0 but it seems that perhaps its more to do with model versions being used? I could be wrong though. Worth training with the model at the link above and seeing if that makes any difference :)
thanks @kcobrien . however it still not solve my problem. Checking now with more complex model
I have the same problem with a GTX 1070 tensorflow 1.10 and cuda 9.0
@wildpig22 @sheucm @moshebitan I have the same problem. Do you solve it?
Duplicate issue! i found an older issue here can we close this and continue the conversation there?
@Prakash19921206
There is no solution for this weird problem in your mentioned thread either. Therefore it makes no difference.
I installed Ubuntu 18.04. its training there much faster!
same issue...
Hi,
I have installed the tensorflow-gpu 1.5 or 1.6.rc0 in accompany with Cuda-9.0 and CuDNN-7.0.5 When I start training using
train.py
, it detects the GPU, but it starts the training on the CPU and CPU load is 100%. The GPU memory gets filled and its core clocks increases but it does not show any consistent load on the cores.