tensorflow / models

Models and examples built with TensorFlow
Other
77.04k stars 45.77k forks source link

GPU is detected but training starts on the CPU #3366

Closed MyVanitar closed 4 years ago

MyVanitar commented 6 years ago

Hi,

I have installed the tensorflow-gpu 1.5 or 1.6.rc0 in accompany with Cuda-9.0 and CuDNN-7.0.5 When I start training using train.py, it detects the GPU, but it starts the training on the CPU and CPU load is 100%. The GPU memory gets filled and its core clocks increases but it does not show any consistent load on the cores.

name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.759
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2018-02-12 11:23:48.533753: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1308] Adding visible gpu devices: 0
2018-02-12 11:23:57.838951: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:989] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4742 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
MyVanitar commented 6 years ago

In the image below it is clear that the GPU memory has filled and clock has increased but no load on the cores. The testing software is correct. I have tested it on many other testings and it is correct. besides 100% CPU is also another evidence.

pr

monomon commented 6 years ago

Seems like the GPU is being used. How are you loading your input data? Could be that that part works intensively on the CPU.

MyVanitar commented 6 years ago

I just followed the procedure. I have created train.record and val.record . it accompanies with a config file, a label_map.pbtxt and a pretrained weights. Then I run training in the console. as you see it detects the GPU, but the load goes to the CPU. I have tested it both with tensorflow-gpu 1.5 and 1.6.rc0

I don't know it is related or not, but I trained till 800 steps but the loss plays around 2 and 3.

cy89 commented 6 years ago

@VanitarNordic are you convinced that this is a bug, and not just some sort of configuration thing? Would you please fill out the usual platform/configuration/reproducibility part of the standard report?

Please provide details about what platform you are using (operating system, architecture). Also include your TensorFlow version. Also, did you compile from source or install a binary? Make sure you also include the exact command if possible to produce the output included in your test case. If you are unclear what to include see the issue template displayed in the Github new issue template.

We ask for this in the issue submission template, because it is really difficult to help without that information. Thanks!

MyVanitar commented 6 years ago

@cy89

Most likely it is a bug, because I have tested many things and I followed a standard procedures. During my google search I saw some other people had also reported something like this. if you feel it is necessary, I'll find where it was.

Platform: Windows10-x64 CUDA: Cuda-9.0.176.1 and CuDNN-7.0.5 - GTX 1060 6G GPU Tested by these Tensorflow versions: 1.5 and 1.6-rc0 (both shows a similar behavior). Installed through pip (pip install tensorflow-gpu)

Training command: (it starts training but with this behavior)

python train.py --logtostderr --train_dir=results/train --peline_config_path=weight/ssd_inception_v2_coco.config

MyVanitar commented 6 years ago

in this issue some users have encountered the same problem: https://github.com/tensorflow/tensorflow/issues/12388#issuecomment-365081928

adriancar commented 6 years ago

I'm getting similar behaviour to what @VanitarNordic describes: Platform: Windows7-x64 CUDA: Cuda-9.0.176 and CuDNN-7.0.5 - GTX 650 Using: Tensorflow-gpu versions: 1.5, Installed through pip pip install --ignore-installed --upgrade tensorflow-gpu as per official tensorflow install instructions for anaconda.

When I run mnist_test.py from https://www.tensorflow.org/tutorials/layers i get cpu under heavy load, no change in gpu load, and when I run nvidia-smi it detects the gpu, but no processes are visible. Also, if I run sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) it successfully finds my gpu.

adriancar commented 6 years ago

Augh, it seemed to be a driver issue for me: I uninstalled, re-downloaded, and re-installed my gpu drivers, restarted my computer, and it seems to be working fine now!

MyVanitar commented 6 years ago

@adriancar

Make sure that it is working and test its operation by the open-hardware-monitor software, because I have done these steps many times but still it does not work. The load is on the CPU. Also make sure that you are using the latest commit

adriancar commented 6 years ago

@VanitarNordic

After reinstalling the driver, when executing training, hardware monitor shows my GPU gets pegged at 100%, while CPU sits at ~25%.

MyVanitar commented 6 years ago

@adriancar

I'm using the latest driver and I had cleaned everything before. Are you using the last commit?

adriancar commented 6 years ago

I used a fresh tensorflow install on a fresh anaconda env. I fetched it using the instructions on tensorflow website - version 1.5.0.

MyVanitar commented 6 years ago

@adriancar

No, what I mean from the latest commit, is the tensorflow detection API repository. when have you downloaded and used the repository?

adriancar commented 6 years ago

I used the MNIST CNN tutorial to test if the GPU was being used: https://www.tensorflow.org/tutorials/layers

CarltonSemple commented 6 years ago

It seems like the published tensorflow-gpu wheel has been a problem for Windows users for a while now: https://github.com/tensorflow/models/issues/1942#issuecomment-316023323 It runs 10 times slower with the GPU for me vs. just using the CPU

Unfortunately I have not been able to successfully compile it for myself, as there are compilation errors https://github.com/tensorflow/tensorflow/issues/16138

civilman628 commented 6 years ago

I just clean install the latest driver for Titan Xp 390.77 on Windows 7, but training still use CPU but not GPU.

MyVanitar commented 6 years ago

@civilman628

Yes, by the evidences it is a bug.

CarltonSemple commented 6 years ago

@VanitarNordic this should probably be moved https://github.com/tensorflow/tensorflow, no?

MyVanitar commented 6 years ago

@CarltonSemple

I don't know because that might happen with the object detection API only. let the @cy89 decide about it.

CarltonSemple commented 6 years ago

I've had it happen with other things.

cy89 commented 6 years ago

@VanitarNordic @CarltonSemple I'm not seeing in the comment stream whether you think this problem is all computations not using the GPU, or whether it's just ssd_inception_v2_coco.

I.e., @VanitarNordic if you run @adriancar 's tutorial MNIST example, do things work as expected?

civilman628 commented 6 years ago

@cy89 not only ssd inception v2, but also ssd mobile v1

MyVanitar commented 6 years ago

@cy89 The computations use the GPU inefficiently (something like blinking but with a long off-periods) and push a constant 100% load on the CPU.

rhys-saldanha commented 6 years ago

Tensorflow fills my GPU RAM then does all the work on the CPU... runs slower than if I force using just the CPU, what's going on here?

MyVanitar commented 6 years ago

@rhys-saldanha

You are not alone. I hope they fix the bug as soon as possible

parinithshekar commented 6 years ago

@VanitarNordic I am experiencing the same behavior. I am trying to use TensorFlow's object detection API. Platform: Ubuntu 16.04 64-bit CUDA: Cuda-9.0.176.1 and CuDNN-7.0.5 - GTX 1060 6GB GPU Tensorflow version: 1.5.0. Installed through pip (pip install tensorflow-gpu also tried tensorflow-gpu==1.5.0) Specifically, I am running SSD mobilenet using a webcam and I can see CPU load spike to 100% while GPU utilization is at 2-5% while temp is upto 58-60C and all of the available GPU memory is used. I am getting close to 0.5FPS which is not logical at all, considering TF is detecting my GPU. Hope this gets resolved soon.

MyVanitar commented 6 years ago

@ParinithShekar

Hi. Yes you are not alone. at least 3 people reported it under this issue. I hope @cy89 consider it as soon as possible.

pianas commented 6 years ago

Any news on this? I have the same problem with tensorflow-gpu 1.6.0rc1, keras 2.1.4 and cuda 9.0 (on both Linux and Windows, single-GPU and multi-GPU).

wildpig22 commented 6 years ago

I have a similar symptom:

While training with SSD Mobilenet(with ssd_mobilenet_v1_coco_2017_11_17 and research\object_detection\train.py), the GPU is not fully loaded, instead it jumps from 0~60% forming a "comb" shaped pattern in GPU-Z

However while trying the cifar-10 training example(tutorials\image\cifar10\cifar10_train.py), GPU usage keeps a solid/constant 90%

I then did some more experiments.

Changing image_resizer to 100100(from original value of 300300) in ssd_mobilenet_v1_coco.config yields a solid/constant gpu usage

Increasing or decreasing batch_size in the same config file changes nothing, still a "comb" shaped pattern

Since in all cases the same installation of tensorflow-gpu 1.6.0 are used, maybe there is some problem within the object detection api?

MyVanitar commented 6 years ago

@cy89

You don't want to consider this?

hbb21st commented 6 years ago

Why my problem still exist. I tried all ways mentioned above, and seemed load in gpu successfully, but...

totalMemory: 3.95GiB freeMemory: 3.91GiB 2018-03-25 11:59:22.538312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1227] Device peer to peer matrix 2018-03-25 11:59:22.538353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1233] DMA: 0 1 2018-03-25 11:59:22.538360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 0: Y N 2018-03-25 11:59:22.538364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 1: N Y 2018-03-25 11:59:22.538374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0, 1 2018-03-25 11:59:23.892312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1616 MB memory) -> physical GPU (device: 0, name: Quadro K2200, pci bus id: 0000:03:00.0, compute capability: 5.0) 2018-03-25 11:59:23.913047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 1616 MB memory) -> physical GPU (device: 1, name: Quadro K2200, pci bus id: 0000:04:00.0, compute capability: 5.0) 2018-03-25 11:59:24.786177: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED 2018-03-25 11:59:24.787125: E tensorflow/stream_executor/cuda/cuda_dnn.cc:393] possibly insufficient driver version: 384.81.0 2018-03-25 11:59:24.787164: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

sheucm commented 6 years ago

I had the same experience !

Platform: Windows 7 64bit CUDA: Cuda-9.0.176 and CuDNN-7.0.5 - GTX 1060 6GB GPU Tensorflow version: 1.8.0. Installed through pip

When runing train.py with ssd_mobilenet_v1 config, gpu loaded almost maximum memory (almost 6GB), but gpu is not used.

Sometimes it went to 90% usage, but most of the time, it showed 0%.

While cpu usage was always 100%.

nitinpapadkar commented 6 years ago

Guys, Is anyone got solution for this problem? I am facing the same problem with TF 1.8 with cuda 9.0 and CUDNN 7.1 I am trying to train dynamic RNN model . With CPU Epoch duration is 2718.3 Seconds, however with GPU the same model takes 7344.9 Seconds. My System Configuration is as below: Lptop: Microsoft surface book Pro2 RAM: 8GB GPU: NVIDIA GForce GTX 965 M CPU: Intel I7 6600 U Quad core

Sri06006 commented 6 years ago

I'm also facing the same problem. I have installed tensorflow-gpu 1.8, validated installation and uses gpu, but the python code, it says using tensforflow as backend, but still CPU memory is 100% and not using GPU .

Have you guys found the solution ?

MyVanitar commented 6 years ago

I have opened this issue from a long time ago and others also introduce bugs for free but they don't consider them at all

Sri06006 commented 6 years ago

Well issue is resolved for me. Here is what I did.

  1. Check the current device from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) Output : Show CPU
  2. run "pip list"
  3. there might be several versions of tensorflow installed
  4. Uninstall others except the tensorflow-gpu
  5. Run "pip install --ignore-installed --upgrade tensorflow-gpu"
  6. Run from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) output: Should show GPU.
MyVanitar commented 6 years ago

@Sri06006

that's a basic consideration for installing GPU related packages which majority of us have correctly installed them and it already shows the GPU

nitinpapadkar commented 6 years ago

For me , the problem is not with GPU utilization. I can see the GPU utilization is there but the training takes more time when I am using GPU. It is more faster on CPU

kcobrien commented 6 years ago

I have also come across this (frustrating) issue.. using GTX 1080 ti and can detect GPU no problem in tensorflow. Have reinstalled drivers several times but still no joy.

Last week when I ran it on the CPU only version it ran perfectly fine for me (albeit slowly), but now the CPU usage flies up to 100% before the whole PC freezes even before the training queues start. A part of me is quite relieved to see that it's not juts me its happening to and that it very well could be a bug. Hopefully there is a fix available soon.

notbrian commented 6 years ago

Same issue is happening to me on Tensorflow-GPU 1.9.0. Lots of CPU usage but only around 2.5% GPU usage.

EDIT: Forgot to mention I was using this repo

kcobrien commented 6 years ago

I got it working by using the following SSD_mobilenet at https://github.com/tensorflow/models/blob/master/object_detection/samples/configs/ssd_mobilenet_v1_pets.config

Also found at: https://pythonprogramming.net/training-custom-objects-tensorflow-object-detection-api-tutorial/

Hope it helps someone! :)

nitinpapadkar commented 6 years ago

@kcobrien : Can you please explain a bit , how did you made the training faster with GPU. The link "https://github.com/tensorflow/models/blob/master/object_detection/samples/configs/ssd_mobilenet_v1_pets.config" is not working for me.

kcobrien commented 6 years ago

@nitinpapadkar Try to get the ssd_mobilenet from this link instead: https://pythonprogramming.net/training-custom-objects-tensorflow-object-detection-api-tutorial/

I am using tensorflow 1.8 and CUDA 9.0 and I made sure to test that my GPU is responsive in tensorflow.

Also worth noting that when my model is training my CPU usage stays between 70-99% (I assume because I am loading images into the model) but it is clearly using my GPU to train and the PC no longer freezes.

I am not sure if there is still a bug with tensorflow 1.8 and cuda 9.0 but it seems that perhaps its more to do with model versions being used? I could be wrong though. Worth training with the model at the link above and seeing if that makes any difference :)

nitinpapadkar commented 6 years ago

thanks @kcobrien . however it still not solve my problem. Checking now with more complex model

moshebitan commented 6 years ago

I have the same problem with a GTX 1070 tensorflow 1.10 and cuda 9.0

Aigul95 commented 5 years ago

@wildpig22 @sheucm @moshebitan I have the same problem. Do you solve it?

Prakash19921206 commented 5 years ago

Duplicate issue! i found an older issue here can we close this and continue the conversation there?

MyVanitar commented 5 years ago

@Prakash19921206

There is no solution for this weird problem in your mentioned thread either. Therefore it makes no difference.

Prakash19921206 commented 5 years ago

I installed Ubuntu 18.04. its training there much faster!

xqiangx1991 commented 5 years ago

same issue...