Closed jacogasp closed 6 years ago
Hi @JacoGasp, Thanks for your report. When using SYCL, firstly the kernels must be compiled to target the specific device you are running on. Since they are quite large, especially for matmul, this will take a while, but you find subsequent runs much quicker.
That being said, you also need to give it decent amount of work to do - small tasks will see basically no speedup on GPUs, if not a slowdown compared to CPU - so make sure that there is enough work to distribute across the full hardware.
ETA: I'm not sure what the distinction between "SYCL" and "GPU" is. Best guess is that "GPU" means an nvidia device, but I've not looked at that closely.
Thanks for the reply @DuncanMcBain, can you explain to me what you mean with compiling the kernels to target the device? I used this guide to compile Tensorflow.
I used the official latest AMD drivers, but I realised just now that my GPU could be not supported. This is wired, because everything seems to be working. At this point I'm not sure about which driver/opencl version I should use, it's a nightmare out there.
Yes, I know that for small tasks the improvement is negligible, but first I tried to run the mnist tutorial, which runs in less than a couple of minutes on my laptop, cpu only. Since with the amd gpu it was "apparently" stuck at step zero, I tried with that simple task and I figured out that using the GPU it's orders of magnitude slower compared to the cpu (41 sec vs 0.17 sec)
To answer your middle question first, AMD's driver situation is... complicated. However, if you have a setup where your GPU is recognised, where clinfo (and computecpp_info) both report it and a basic test works... I'd stick with that!
To answer your other two questions, what I mean is that when you compile TensorFlow, we output some "intermediate representation" representing the computational core of TenorFlow (for example, a function that performs a matrix multiplication). It is intermediate because we don't know what system you will eventually run on (be it Intel, AMD, ARM...), so before execution these kernels must be further compiled into code that will run on your hardware. This is done by the OpenCL implementation, and particularly in the case of matmul operations will take a while as there is a lot of code involved in those operations.
If you try timing a second matmul, the numbers should be much more reasonable (as the results of this compilation phase are cached). When you are using TensorFlow, it looks orders of magnitude slower, but the likelihood is that the compilation took about 40 of those seconds - an easy way to test this is to do the same thing over again in your Python code and measure the time taken then. Similarly, there will be a long warmup time for the mnist sample, but after then it will run without an initial delay.
Ok, got it. Clear explanation (I badly read your first post, now it's clear).
So, assuming that my gpu is fully working, I will try to add a second matmul to test the performance.
From your answer, I also assume that there's no way to compile those intermediate files only one-time, isn't it?
For my purposes, I need to train relative simple networks with relative small datasets. At work, I'm using a GeForce 1060 GTX that allows me to train my networks in a couple of minutes. At home I would like to use my gaming AMD GPU to do some experiments, but if with OpenCL it takes me the same time or more (compared to the NVIDIA) only to intialize Tensorflow, probably it is not worthy rather then using the CPU.
We don't have any on-disk caching in place at the moment, though as features go it would likely be very useful.
You'd need to experiment a little to find the cutoff point - as I say, the more work you do in a single session, the better your results will be. Hopefully we can get some fixes in to make this warm-up time better, but until then there will be a start-up delay.
Perfect, thank you for your help! I'll stay tuned
Hey @DuncanMcBain, you're perfectly right!
From this
import tensorflow as tf
import time
start_time = time.time()
with tf.device('/SYCL:0'):
a = tf.Variable(tf.truncated_normal([3000, 3000], seed=1))
b = tf.Variable(tf.truncated_normal([3000, 3000], seed=2))
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
print("Initialized after: ", time.time() - start_time)
for i in range(1, 6):
step_time = time.time()
tf.matmul(a, b).eval()
print("Elapsed time at step %d: %.20f " % (i, time.time() - step_time))
print("Total elapsed time: %.20f " % (time.time() - start_time))
the output with GPU is:
Initialized after: 5.496872425079346
Elapsed time at step 1: 35.02785444259643554688
Elapsed time at step 2: 0.10464525222778320312
Elapsed time at step 3: 0.10469603538513183594
Elapsed time at step 4: 0.10484170913696289062
Elapsed time at step 5: 0.10464382171630859375
Total elapsed time: 40.94383144378662109375
Process finished with exit code 0
while with CPU:
Initialized after: 0.325620174407959
Elapsed time at step 1: 0.36024403572082519531
Elapsed time at step 2: 0.36671090126037597656
Elapsed time at step 3: 0.38034081459045410156
Elapsed time at step 4: 0.38145875930786132812
Elapsed time at step 5: 0.38542723655700683594
Total elapsed time: 2.20009613037109375000
Now everything makes sense! Thanks for your time
@JacoGasp great, I'm glad that's helped! If you've any other questions, let us know.
I successfully compiled the amd_gpu branch of this repo and I'm finally able to run tensorflow with OpenCL. The problem is that it is totally unusable because of the extremely long computing time, even for very basic operations.
Using the following code
it takes more than 40 seconds on a Radeon HD7950. Here the output:
while the same task takes on my laptop (CPU) less than 0.1s.
Specifying
with tf.device('/CPU:0'):
I obtain a more reasonable 0.17s computational time. Using'/SYCL:0'
brings to the same as not specifying the device at all (again +40s), while using'/GPU:0
produce the following error:System information
computecpp_info
clinfo
EDIT: I'm using
export GPU_FORCE_64BIT_PTR=1
to make it run.