lukeiwanski / tensorflow

OpenCL support for TensorFlow via SYCL
Apache License 2.0
65 stars 14 forks source link

cifar10_train.py on AMD SoC GPU (Kalindi) is 4 times slower than its SoC CPU (Kabini) #239

Open enihcam opened 6 years ago

enihcam commented 6 years ago

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow

If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.
  3. It shouldn't be a TensorBoard issue. Those go here.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

$ python ./models/tutorials/image/cifar10/cifar10_train.py
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
2018-05-02 13:08:53.003386: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-05-02 13:08:53.003504: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Kalindi, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-05-02 13:03:32.883491: step 2560, loss = 1.38 (19.2 examples/sec; 6.683 sec/batch)
2018-05-02 13:04:39.774720: step 2570, loss = 1.34 (19.1 examples/sec; 6.689 sec/batch)
2018-05-02 13:05:46.625889: step 2580, loss = 1.43 (19.1 examples/sec; 6.685 sec/batch)

For CPU-based tensorflow, it was around ~80 examples/sec.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Build configuration: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=tensorflow-computecpp

lukeiwanski commented 6 years ago

Hi @enihcam

Thanks for the report. I will do my best to help. To clarify is that this device: https://www.techpowerup.com/gpudb/2197/radeon-hd-8280e ?

As well, do you know if the device you are using has physical local memory or is it using global memory to simulate it?

enihcam commented 6 years ago

Thank you @lukeiwanski. Yes it is. Also the full processor name is AMD A4-5000 APU with Radeon(TM) HD Graphics.

Sorry, what do you mean 'global memory to simulate it'? Since it is an integrated GPU, it uses system RAM (DDR3) shared by processor.

DuncanMcBain commented 6 years ago

Hi @enihcam, While it is true that the performance on your hardware is low, we think there are a few factors contributing to this. The iGPU in your SoC is barely more powerful than the CPU, and we should therefore expect performance that (at best) would be on-par with the CPU. However, given the design we have taken at the moment (focussing on discrete GPUs with many CUs and high memory bandwidth), it is likely that the code as-is will not perform well on an AMD APU.

More specifically, it seems likely to me that there will be some redundant copies on APU hardware (since the memory is shared between the CPU and GPU). For these reasons, I don't think you will obtain good performance on this hardware, even if (as is likely) there are still optimisations we could make to our TensorFlow efforts.

mirh commented 6 years ago

Are you using latest opencl-amd?

More specifically, it seems likely to me that there will be some redundant copies on APU hardware

Putting aside whatever specific low end consideration now (his gpu should crunch just short of 150 Gflops btw).. shouldn't you look into zero copy then if that happens?

DuncanMcBain commented 6 years ago

That's certainly a possibility, but I don't imagine that this is an interesting optimisation target for us right now. That said I might be wrong - CodeXL might be able to provide some traces showing whether excessive time is being spent copying the buffers around.

enihcam commented 6 years ago

Thank you @mirh @DuncanMcBain

Yes, I'm using latest opencl-amd (ver 18.10.572953). How to enable zero-copy?

DuncanMcBain commented 6 years ago

It would be more instructive to be sure that this is the issue first than to delve into the guts when, indeed, this optimisation might already be in effect.

As I say, however, this hardware isn't currently an interesting target to us.

enihcam commented 6 years ago

Also I would like to know what is the performance of tensorflow-computecpp on intel GPU? is it also slower than CPU?

DuncanMcBain commented 6 years ago

I don't believe it is, though I don't have any numbers to hand at the moment (I don't have that hardware, and we don't test it internally, but I think we've done some ad-hoc tests).

lukeiwanski commented 6 years ago

@enihcam / @DuncanMcBain after Neo driver was released we have Skylake series SoC available for tests and benchmarks - there is nothing ad-hoc about this ;) @enihcam is there any particular benchmark / model you are interested in?

enihcam commented 6 years ago

@lukeiwanski yes, i'm going to install it on KabyLake (i5-7200U) :D

For AMD SoC, I'm wondering, is there any flags required to be turn on (or off) in kernel config? I ask this because all my linux boxes are using customized kernel. config.txt

mirh commented 6 years ago

As long as AMDGPU and AMDKFD are there, I don't think there's any other particular requirement for it to perform "properly". The thing is, computecpp might just be optimized for the "big dedicated gpu" scenario, rather than "tiny shared" one.

I'm not sure how much of ROCm or HSA Kabini supports, anyway many features should be already exposed via opencl. And if you care for it, as they told you, you should get aboard the profiling train.

EDIT: also of a fun fact, fglrx used to support 2.0 there once upon a time

enihcam commented 6 years ago

@mirh Aha! That explains why Kabini GPU is slow. It does NOT support HSA (i.e. AMDKFD)!!

mirh commented 6 years ago

Nothing of that is used at all here in the first place. Then, even though you are right Jaguar/Puma apus don't support HSA (as for KFD, which is way more than just that, things may or may not improve in the future depending on how extensively AMD will be able to "backport" the thing) I was just suggesting some room for Fine-Grain SVM buffer optimizations. (which should also be more or less the same feature level of Intel Gen8 igps)