Amd GPU extremely slow computing basic math

jacogasp commented 6 years ago

I successfully compiled the amd_gpu branch of this repo and I'm finally able to run tensorflow with OpenCL. The problem is that it is totally unusable because of the extremely long computing time, even for very basic operations.

Using the following code

import tensorflow as tf
import time

start_time = time.time()

a = tf.Variable(tf.truncated_normal([2, 2], seed=1))
b = tf.Variable(tf.truncated_normal([2, 2], seed=2))

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

print(tf.matmul(a, b).eval())

print("Elapsed time: ", time.time() - start_time)

it takes more than 40 seconds on a Radeon HD7950. Here the output:

2018-02-25 22:03:55.081221: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-02-25 22:03:55.081274: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Tahiti, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
[[-0.85811085 -0.19662298]
 [ 0.13895045 -1.2212768 ]]
Elapsed time:  41.101563453674316

Process finished with exit code 0

while the same task takes on my laptop (CPU) less than 0.1s.

Specifying with tf.device('/CPU:0'): I obtain a more reasonable 0.17s computational time. Using '/SYCL:0' brings to the same as not specifying the device at all (again +40s), while using '/GPU:0 produce the following error:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'Variable_1': Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:SYCL:0 ]. Make sure the device specification refers to a valid device.
     [[Node: Variable_1 = VariableV2[container="", dtype=DT_FLOAT, shape=[2,2], shared_name="", _device="/device:GPU:0"]()]]

System information

OS Platform and Distribution: Ubuntu 16.04
TensorFlow installed from: source
TensorFlow version: v1.1.0-rc1-12863-ge566aef' 1.6.0-rc0
Python version: Anaconda python 3.6.4
Bazel version: 0.11.0
GCC/Compiler version: 5.4.0
GPU model and memory: Radeon HD7950 3GB
GPU Driver: amdGPU-PRO Driver 17.50

computecpp_info

********************************************************************************

ComputeCpp Info (CE 0.6.0)

********************************************************************************

Toolchain information:

GLIBC version: 2.23
GLIBCXX: 20160609
This version of libstdc++ is supported.

********************************************************************************

Device Info:

Discovered 1 devices matching:
  platform    : <any>
  device type : <any>

--------------------------------------------------------------------------------
Device 0:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Tahiti
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2527.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU 

If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v0.6.0/platform-support-notes

********************************************************************************

clinfo

Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (2527.3)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 
  Platform Host timer resolution                  <printPlatformInfo:5: get CL_PLATFORM_HOST_TIMER_RESOLUTION : error -30>
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 1
  Device Name                                     Tahiti
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.2 AMD-APP (2527.3)
  Driver Version                                  2527.3
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Board Name (AMD)                         AMD Radeon HD 7900 Series
  Device Topology (AMD)                           PCI-E, 01:00.0
  Max compute units                               14
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                16
  SIMD instruction width (AMD)                    1
  Max clock frequency                             950MHz
  Graphics IP (AMD)                               6.0
  Device Partition                                (core)
    Max number of sub-devices                     14
    Supported partition types                     none specified
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Preferred work group size multiple              64
  Wavefront width (AMD)                           64

EDIT: I'm using export GPU_FORCE_64BIT_PTR=1 to make it run.

DuncanMcBain commented 6 years ago

Hi @JacoGasp, Thanks for your report. When using SYCL, firstly the kernels must be compiled to target the specific device you are running on. Since they are quite large, especially for matmul, this will take a while, but you find subsequent runs much quicker.

That being said, you also need to give it decent amount of work to do - small tasks will see basically no speedup on GPUs, if not a slowdown compared to CPU - so make sure that there is enough work to distribute across the full hardware.

ETA: I'm not sure what the distinction between "SYCL" and "GPU" is. Best guess is that "GPU" means an nvidia device, but I've not looked at that closely.

jacogasp commented 6 years ago

Thanks for the reply @DuncanMcBain, can you explain to me what you mean with compiling the kernels to target the device? I used this guide to compile Tensorflow.

I used the official latest AMD drivers, but I realised just now that my GPU could be not supported. This is wired, because everything seems to be working. At this point I'm not sure about which driver/opencl version I should use, it's a nightmare out there.

Yes, I know that for small tasks the improvement is negligible, but first I tried to run the mnist tutorial, which runs in less than a couple of minutes on my laptop, cpu only. Since with the amd gpu it was "apparently" stuck at step zero, I tried with that simple task and I figured out that using the GPU it's orders of magnitude slower compared to the cpu (41 sec vs 0.17 sec)

DuncanMcBain commented 6 years ago

To answer your middle question first, AMD's driver situation is... complicated. However, if you have a setup where your GPU is recognised, where clinfo (and computecpp_info) both report it and a basic test works... I'd stick with that!

To answer your other two questions, what I mean is that when you compile TensorFlow, we output some "intermediate representation" representing the computational core of TenorFlow (for example, a function that performs a matrix multiplication). It is intermediate because we don't know what system you will eventually run on (be it Intel, AMD, ARM...), so before execution these kernels must be further compiled into code that will run on your hardware. This is done by the OpenCL implementation, and particularly in the case of matmul operations will take a while as there is a lot of code involved in those operations.

If you try timing a second matmul, the numbers should be much more reasonable (as the results of this compilation phase are cached). When you are using TensorFlow, it looks orders of magnitude slower, but the likelihood is that the compilation took about 40 of those seconds - an easy way to test this is to do the same thing over again in your Python code and measure the time taken then. Similarly, there will be a long warmup time for the mnist sample, but after then it will run without an initial delay.

jacogasp commented 6 years ago

Ok, got it. Clear explanation (I badly read your first post, now it's clear).

So, assuming that my gpu is fully working, I will try to add a second matmul to test the performance.

From your answer, I also assume that there's no way to compile those intermediate files only one-time, isn't it?

For my purposes, I need to train relative simple networks with relative small datasets. At work, I'm using a GeForce 1060 GTX that allows me to train my networks in a couple of minutes. At home I would like to use my gaming AMD GPU to do some experiments, but if with OpenCL it takes me the same time or more (compared to the NVIDIA) only to intialize Tensorflow, probably it is not worthy rather then using the CPU.

DuncanMcBain commented 6 years ago

We don't have any on-disk caching in place at the moment, though as features go it would likely be very useful.

You'd need to experiment a little to find the cutoff point - as I say, the more work you do in a single session, the better your results will be. Hopefully we can get some fixes in to make this warm-up time better, but until then there will be a start-up delay.

jacogasp commented 6 years ago

Perfect, thank you for your help! I'll stay tuned

jacogasp commented 6 years ago

Hey @DuncanMcBain, you're perfectly right!

From this

import tensorflow as tf
import time

start_time = time.time()
with tf.device('/SYCL:0'):
    a = tf.Variable(tf.truncated_normal([3000, 3000], seed=1))
    b = tf.Variable(tf.truncated_normal([3000, 3000], seed=2))

    sess = tf.InteractiveSession()
    sess.run(tf.global_variables_initializer())
    print("Initialized after: ", time.time() - start_time)

    for i in range(1, 6):
        step_time = time.time()
        tf.matmul(a, b).eval()
        print("Elapsed time at step %d: %.20f " % (i, time.time() - step_time))
print("Total elapsed time: %.20f " % (time.time() - start_time))

the output with GPU is:

Initialized after:  5.496872425079346
Elapsed time at step 1: 35.02785444259643554688 
Elapsed time at step 2: 0.10464525222778320312 
Elapsed time at step 3: 0.10469603538513183594 
Elapsed time at step 4: 0.10484170913696289062 
Elapsed time at step 5: 0.10464382171630859375 
Total elapsed time: 40.94383144378662109375 

Process finished with exit code 0

while with CPU:

Initialized after:  0.325620174407959
Elapsed time at step 1: 0.36024403572082519531 
Elapsed time at step 2: 0.36671090126037597656 
Elapsed time at step 3: 0.38034081459045410156 
Elapsed time at step 4: 0.38145875930786132812 
Elapsed time at step 5: 0.38542723655700683594 
Total elapsed time: 2.20009613037109375000

Now everything makes sense! Thanks for your time

DuncanMcBain commented 6 years ago

@JacoGasp great, I'm glad that's helped! If you've any other questions, let us know.

lukeiwanski / tensorflow

Amd GPU extremely slow computing basic math #213

System information