Object detection using GPU on Windows is about 5 times slower than on Ubuntu

rniebecker commented 7 years ago

System information

What is the top-level directory of the model you are using: ssd_mobilenet_v1_coco_11_06_2017
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04 Windows 10 build 15063.413
TensorFlow installed from (source or binary): Binary
TensorFlow version (use command below): v1.2.0-rc2-21-g12f033d
Bazel version (if compiling from source): NA
CUDA/cuDNN version: 8.0/5.1
GPU model and memory: Ubuntu GTX970 4GB Windows GTX1080 12GB
Exact command to reproduce: Using Object Detection Demo with the 2 images provided

Describe the problem

Installed Nvidia Cuda and cuDNN on both systems. Installed tensorflow on Ubuntu and Windows with GPU support both times using Python 3.5.2 and the native pip installation. Installed and setup models as required.

My Windows is running on a machine with a GTX1080, my Ubuntu box with GTX970. Comparing the results between the 2 setups shows that on Windows the object detection is about 5 times slower than on Ubuntu. Measuring the GPU load on Windows using GPU-Z shows that the GPU is barely being used when the object detection is running, only jumping up to about 16% from time to time. I did the same test with tensorflow without GPU support on the same machine and the result is about 20% slower than compared to with GPU.

Interestingly when using the mnist_deep.py example from site-packages/tensorflow/examples/tutorials/mnist the GPU load on my Windows machine goes up to about 62%.

It looks like object_detection is for some reason not using the full capabilities of the GPU on Windows, the question is why?

Are there any additional configurations I have to perform to speed it up? Are there any Nvidia windows graphic driver settings I can/should use? Anyone else having the same issue?

Cheers, Ralf

Zumbalamambo commented 7 years ago

how do u run it on windows?

Zumbalamambo commented 7 years ago

Im trying to run object detection example. I ran jupyter notebook in object_detection directory. then I opened the notebook file. It is firing the following error

ImportError Traceback (most recent call last)
in () ----> 1 from utils import label_map_util 2 3 from utils import visualization_utils as vis_util C:\Users\Documents\models-master\models-master\object_detection\utils\label_map_util.py in () 20 import tensorflow as tf 21 from google.protobuf import text_format ---> 22 from object_detection.protos import string_int_label_map_pb2 23 24 ImportError: cannot import name 'string_int_label_map_pb2'

rniebecker commented 7 years ago

Did you follow the object_detection installation instructions? You can find them here: https://github.com/tensorflow/models/blob/master/object_detection/g3doc/installation.md

You will need to download the binary protoc from Google, you can find it here: https://github.com/google/protobuf/releases

Download the protoc-3.3.0-win32.zip and store the executable somewhere in your path.

Cheers, Ralf

Zumbalamambo commented 7 years ago

yes but it is for linux. windows is not letting me to use sudo apt. i have to put just the exe file in some path?

rniebecker commented 7 years ago

I assume you have already installed python3.5 (64bit) with tensorflow-gpu as described here: https://www.tensorflow.org/install/install_windows

Then you need to download the binary protoc from Google, you can find it here: https://github.com/google/protobuf/releases Download the protoc-3.3.0-win32.zip and store the executable somewhere in your path so when you type protoc in the console (cmd) it tells you: Missing input file.

Go to the models directory (after cloning it) and execute: protoc object_detection/protos/*.proto --python_out=.

Add environment variable PYTHONPATH so it looks like this: PYTHONPATH=C:\Python35\Lib\site-packages\tensorflow\models;C:\Python35\Lib\site-packages\tensorflow\models\slim Adjust the paths so they point to your models and models\slim directories

Additionally I executed: python setup.py

To test if it works open up a new console (cmd), go to the models directory and execute: python object_detection/builders/model_builder_test.py

Cheers, Ralf

rniebecker commented 7 years ago

Funny thing, I compiled tensorflow myself with GPU support on Windows using cmake as described here: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cmake and now it's working with the same performance as on Ubuntu!

So it looks like the problem is that the windows wheels (tensorflow-gpu) distributed by Google are not correctly compiled to fully utilize the GPU on Windows... !?

Cheers, Ralf

AndrewKemendo commented 7 years ago

That is definitely the case, as even if you install tensorflow-gpu and properly allocate tensorflow to GPU and run inception it will give you the "The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations" error.

Further protoc 3.3.0 doesn't seem to support protoc .proto --python_out=. as it throws errors on the but you can compile the protocs individually by typing the filename.proto

Really odd

rniebecker commented 7 years ago

I compiled tensorflow with SIMD optimizations but you need to do some code changes for it to compile on Windows. The result does not complain anymore regarding SSE instructions but there is no measurable improvements in performance that I can see.

Regarding protoc I didn't have any issue as you described.

Cheers, Ralf

Coderx7 commented 7 years ago

Any update on this? I'm having the same issue here , I get Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0 and while GPU memory is filled up!, the performance is extremely slow. seems everything is loaded to the CPU instead of GPU, since the CPU utilization is nearly 100% tensorflow_cpu_usage tensorflow_gpu_usage

I also noticed in ubuntu the same issue exists, but its at least 4 times faster than the windows branch (each step takes 400 ms while in windows it takes 1300 ms)
I'm using tensorflow 1.3.0 on both Ubuntu(14.04) and Windows and they are both installed using pip install --upgrade tensorflow-gpu command

and here is the full log :

G:\Tensorflow_section\models-master\object_detection>python train.py  --logtostderr --train_dir=training_stuff --pipeline_config_path=ssd_mobilenet_v1_pets.config
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
WARNING:tensorflow:From C:\Users\Master\Anaconda3\envs\anaconda35\lib\site-packages\object_detection-0.1-py3.5.egg\object_detection\meta_architectures\ssd_meta_arch.py:607: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
2017-09-18 03:44:08.545358: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-18 03:44:08.545474: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-18 03:44:09.121357: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.835
pciBusID 0000:01:00.0
Total memory: 8.00GiB
Free memory: 6.63GiB
2017-09-18 03:44:09.121483: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:976] DMA: 0
2017-09-18 03:44:09.122196: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:986] 0:   Y
2017-09-18 03:44:09.133158: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
INFO:tensorflow:Restoring parameters from training_stuff\model.ckpt-0
2017-09-18 03:44:15.528390: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:697] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path training_stuff\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 20.1465 (18.034 sec/step)
INFO:tensorflow:global step 2: loss = 15.8647 (1.601 sec/step)
INFO:tensorflow:global step 3: loss = 13.3987 (1.540 sec/step)
INFO:tensorflow:global step 4: loss = 11.5424 (1.562 sec/step)
INFO:tensorflow:global step 5: loss = 10.8328 (1.337 sec/step)
INFO:tensorflow:global step 6: loss = 10.7179 (1.317 sec/step)
INFO:tensorflow:global step 7: loss = 9.7616 (1.369 sec/step)
INFO:tensorflow:global step 8: loss = 8.5631 (1.336 sec/step)
INFO:tensorflow:global step 9: loss = 7.2683 (1.384 sec/step)

rs0h commented 6 years ago

@Coderx7 Have you found any resolution on the issue?

Is there any updates?

CarltonSemple commented 6 years ago

Has this improved with tensorflow-gpu 1.4?

Zumbalamambo commented 6 years ago

@CarltonSemple no

mljack commented 6 years ago

I got similar results on my Win10 comparing with results @rniebecker got . Full load on 20 CPU cores and 15% GPU usage on GTX 1080ti while training faster R-CNN with resnet101 feature extractor. Profiling shows that grad_clip consumes most of CPU cycles.

Edit: On Ubuntu, 50% CPU usage and 80% GPU usage. Training time cut in half. (0.37s vs 0.75s)

blaskowitz100 commented 6 years ago

Yeah, same problem for me. 100% CPU load and only 15% on my GTX1080. The same model trains twice as fast on my Ubuntu machine with a GTX970.

civilman628 commented 6 years ago

Yes, same for me. I am on windows 7, GPU Titan XP.

CPU has very large consumption, near 100%, but GPU use less, though GPU memory is taken. This is not normal. And training is very slow.

CarltonSemple commented 6 years ago

@rniebecker do you know if those code changes were ever merged into the master branch? If not, could you share with us what you did to make it compile successfully?

isohrab commented 6 years ago

Any solution? I run same code on windows and linux with same HW configuration, linux is 8 times faster than windows. linux use 30% CPUand 80% GPU, but windows use 99% CPU and 20 GPU

AshimaChawla commented 6 years ago

Hi @isohrab,

I switched to NVIDIA GTX 1070, and I could see that training the LSTM model with GPU (uisng cmake) is very very slow as compared to normal CPU system on windows 10 system.

CUDA- 9.0 CuDnn- 7.1.4 tf- 1.7.1 keras- 2.1.6 python-3.6

Is there any credible performance change in Ubuntu as compared to Windows??

Could you please guide.

Regards, Ashima

csenw commented 6 years ago

Same here. I just did let's with tensorflow with ptb data and found ubuntu can do 0.85 epoch per min comparing to 0.22 epoch per min on windows 10.

fzqneo commented 6 years ago

I also observe odd but different performance from what are described above. Neither GPU nor CPU is fully loaded and SSD+MobileNetV1 runs at 10 fps on GTX 1080. Other than that, everything else looks normal.

Windows 10 Version 10.0.17134 Build 17134
CUDA 9.0
CUDNN 7.0
GTX 1080 8GB
Anaconda Python 3.5 + TensorFlow 1.8/1.5 (installed as described on website; tried both versions but same issue)

And I simply measured the inference time in a tight loop in the official tutorial:

def run_inference_for_single_image(image, graph):
   # …
      # Run inference
      tic = time.time()
      for _ in range(100):
        output_dict = sess.run(tensor_dict,
                         feed_dict={image_tensor: np.expand_dims(image, 0)})
      toc = time.time()
      print("sess.run time: %f" % ((toc - tic)/100.0))

SSD+MobileNet v1 measures ~100ms while Faster RCNN + ResNet 101 measures ~220 ms. This is really odd. MobileNet is too slow. Also Google's GitHub page shows >3X performance gap between the two. Also, during inference, neither CPU or GPU is fully loaded. GPU at around 15% utilization but GPU memory is fully used. Inference results look perfectly fine.

Prakash19921206 commented 5 years ago

Same issue on windows (i installed usingpip install tensorflow-gpu==1.5) python 3.5 Cuda 9.0 Cudnn 7.0

Training batch size is default (24) [ in file ssd_mobilnet_v1_pets.config ]

CPU almost 100% most of the time on gtx 1080ti GPU load goes upto 16% again drops to 0%

If i alter batch size, training time per step is changing accordingly. As per my understanding, image resize (300 , 300) is happening on CPU after resize it sends the data to GPU. can we make entire thing to happen on GPU?

Thanks & Regards

OevreFlataeker commented 5 years ago

Funny thing, I compiled tensorflow myself with GPU support on Windows using cmake as described here: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cmake and now it's working with the same performance as on Ubuntu!

So it looks like the problem is that the windows wheels (tensorflow-gpu) distributed by Google are not correctly compiled to fully utilize the GPU on Windows... !?

Cheers, Ralf

@rniebecker I also just compiled TF 1.13.1 from scratch with CUDA 10.0 and the latest nvidia drivers in an anaconda environment with python 3.7 but I can't say it's really faster... went along https://medium.com/@amsokol.com/update-2-how-to-build-and-install-tensorflow-gpu-cpu-for-windows-from-source-code-using-bazel-61c26553f7e8 GPU is the only GPU in the system so other (Windows) processes are running on GPU at the same time (e.g. Chrome)

What build parameter did you use?

rlewkowicz commented 5 years ago

@jch1 Any word on this? (I just ask because it's assigned to you) I'm still seeing about 5x performance loss on windows vs Linux. In particular I'm using custom data-sets trained against yolov3 using this implementation :

https://github.com/zzh8829/yolov3-tf2

On Linux, I'm running at about 7-10ms per frame. On the same video, on windows, I run at about 50-60ms.

I'm not seeing the CPU utilization some of these other users are.

Cuda 10.0 both systems. With similar NVIDIA versions. I can be more specific if you need, but I suspect there's something more deeply rooted than just my versions given this bug is 2 years old.

OevreFlataeker commented 5 years ago

@jch1 Any word on this? (I just ask because it's assigned to you) I'm still seeing about 5x performance loss on windows vs Linux. In particular I'm using custom data-sets trained against yolov3 using this implementation :

Just for my understanding: Do you see this bad performance on any object detection model or just for yolov3? I've tried a couple of them resnet, ssd_mobilenet, ... and non of them was en par with the Linux version, despite the Linux GPU was only half as fast as the one on Windows (750 Ti vs 970) The only difference I spot is that with Linux, the GPU is exclusively used by TF and on Windows it is the default graphics adapter and for example also Chrome and other Windows processes have some shard of it.

rlewkowicz commented 5 years ago

I think we chatted about this in the other thread, but I'm only using yolov3.

This bug https://github.com/tensorflow/tensorflow/issues/29874

was assigned to achandraa so we'll see where it goes!

caocuong0306 commented 5 years ago

Hi guys, did you solve this problem? Please give me some tips to fix the issue. I'm having the same problem, the performance on Windows is 2.5x worse than that on Ubuntu using the same settings (TF 1.13.1 + cuda 10 + cudnn 7.6). The tested GPU card is RTX 2080 Ti.

Update 1: @rlewkowicz, could you please confirm about WDDM issue that you mentioned in https://github.com/tensorflow/tensorflow/issues/29874. If this is the case, should I need to change my GPU to Titan series?

Update 2: I changed my GPU card to NVIDIA Titan Xp and set TCC mode. However, the performance on NVIDIA Titan Xp still ~5x slower than that on Ubuntu (also with NVIDIA Titan Xp).

Thanks.

OevreFlataeker commented 5 years ago

No news here but it is "good" to hear, that it seems to be a verified problem and not related to my specific setup/system....

But if it is related to the driver model - wouldn't that mean that ALL applications on Windows using the official drivers will be affected as well? I'd expect this would have put some pressure on NVIDIA if it was like that? Any info about related problems with different apps/frameworks?

Coderx7 commented 5 years ago

@OevreFlataeker : This is not specific to TF only as I have the very same issue with Pytorch on Windows as well. Both Pytorch and Tensorflow are at least 2~3x slower than their respective counterparts on Ubuntu.

caocuong0306 commented 5 years ago

But what's the point of setting NVIDIA GPU to TCC mode on Windows? I changed my Titan Xp to TCC and nothing happens.

Update: After reconfiguring everything, I was able to obtain reasonable performance on Windows, even though the program on Ubuntu is still 1.2x faster.

Thank you guys.

pauliver commented 5 years ago

I'm having the same problem @caocuong0306 - when you say you reconfigured everything, what exactly did you do? Remove everything and start over?

tensorflowbutler commented 4 years ago

Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

civilman628 commented 4 years ago

I don't believe the issue is fixed.

Zumbalamambo commented 4 years ago

I don't believe the issue is fixed.

3d-illusions commented 3 years ago

same problem on a 1070. Installed in anaconda prompt:

conda create -n tf-gpu tensorflow-gpu conda activate tf-gpu

Windows 10, conda: tf-gpu (python 3.8.5) using Spyder (if that matters)

3d-illusions commented 3 years ago

does this issue only affect non rtx cards?

tensorflow / models

Object detection using GPU on Windows is about 5 times slower than on Ubuntu #1942

System information

Describe the problem