cifar10_train.py on AMD SoC GPU (Kalindi) is 4 times slower than its SoC CPU (Kabini)

enihcam commented 6 years ago

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow

If you open a GitHub issue, here is our policy:

It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
The form below must be filled out.
It shouldn't be a TensorBoard issue. Those go here.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Archlinux
TensorFlow installed from (source or binary): source

TensorFlow version (use command below):

$ python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
b'ComputeCpp-v0.6.0-30-g4cc789977d' 1.6.0-rc0

Python version: 3.6.5
Bazel version (if compiling from source): 0.12.0
GCC/Compiler version (if compiling from source): 7.3.1 20180406
CUDA/cuDNN version: N/A

GPU model and memory:

Device Name                                     Kalindi
Device Vendor                                   Advanced Micro Devices, Inc.
Device Vendor ID                                0x1002
Device Version                                  OpenCL 1.2 AMD-APP (2580.4)
Driver Version                                  2580.4
Device OpenCL C Version                         OpenCL C 1.2
Device Type                                     GPU
Device Board Name (AMD)                         AMD Radeon Graphics
Device Topology (AMD)                           PCI-E, 00:01.0
Device Profile                                  FULL_PROFILE
Device Available                                Yes
Compiler Available                              Yes
Linker Available                                Yes
Max compute units                               2
SIMD per compute unit (AMD)                     4
SIMD width (AMD)                                16
SIMD instruction width (AMD)                    1
Max clock frequency                             496MHz
Graphics IP (AMD)                               7.2
Device Partition                                (core)
Max number of sub-devices                     2
Supported partition types                     (n/a)
Supported affinity domains                    (n/a)
Max work item dimensions                        3
Max work item sizes                             1024x1024x1024
Max work group size                             256
Preferred work group size (AMD)                 256
Max work group size (AMD)                       1024
Preferred work group size multiple              64
Wavefront width (AMD)                           64

Exact command to reproduce: python ./models/tutorials/image/cifar10/cifar10_train.py You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

$ python ./models/tutorials/image/cifar10/cifar10_train.py
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
2018-05-02 13:08:53.003386: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-05-02 13:08:53.003504: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Kalindi, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-05-02 13:03:32.883491: step 2560, loss = 1.38 (19.2 examples/sec; 6.683 sec/batch)
2018-05-02 13:04:39.774720: step 2570, loss = 1.34 (19.1 examples/sec; 6.689 sec/batch)
2018-05-02 13:05:46.625889: step 2580, loss = 1.43 (19.1 examples/sec; 6.685 sec/batch)

For CPU-based tensorflow, it was around ~80 examples/sec.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Build configuration: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=tensorflow-computecpp

lukeiwanski commented 6 years ago

Hi @enihcam

Thanks for the report. I will do my best to help. To clarify is that this device: https://www.techpowerup.com/gpudb/2197/radeon-hd-8280e ?

As well, do you know if the device you are using has physical local memory or is it using global memory to simulate it?

enihcam commented 6 years ago

Thank you @lukeiwanski. Yes it is. Also the full processor name is AMD A4-5000 APU with Radeon(TM) HD Graphics.

Sorry, what do you mean 'global memory to simulate it'? Since it is an integrated GPU, it uses system RAM (DDR3) shared by processor.

DuncanMcBain commented 6 years ago

Hi @enihcam, While it is true that the performance on your hardware is low, we think there are a few factors contributing to this. The iGPU in your SoC is barely more powerful than the CPU, and we should therefore expect performance that (at best) would be on-par with the CPU. However, given the design we have taken at the moment (focussing on discrete GPUs with many CUs and high memory bandwidth), it is likely that the code as-is will not perform well on an AMD APU.

More specifically, it seems likely to me that there will be some redundant copies on APU hardware (since the memory is shared between the CPU and GPU). For these reasons, I don't think you will obtain good performance on this hardware, even if (as is likely) there are still optimisations we could make to our TensorFlow efforts.

mirh commented 6 years ago

Are you using latest opencl-amd?

More specifically, it seems likely to me that there will be some redundant copies on APU hardware

Putting aside whatever specific low end consideration now (his gpu should crunch just short of 150 Gflops btw).. shouldn't you look into zero copy then if that happens?

DuncanMcBain commented 6 years ago

That's certainly a possibility, but I don't imagine that this is an interesting optimisation target for us right now. That said I might be wrong - CodeXL might be able to provide some traces showing whether excessive time is being spent copying the buffers around.

enihcam commented 6 years ago

Thank you @mirh @DuncanMcBain

Yes, I'm using latest opencl-amd (ver 18.10.572953). How to enable zero-copy?

DuncanMcBain commented 6 years ago

It would be more instructive to be sure that this is the issue first than to delve into the guts when, indeed, this optimisation might already be in effect.

As I say, however, this hardware isn't currently an interesting target to us.

enihcam commented 6 years ago

Also I would like to know what is the performance of tensorflow-computecpp on intel GPU? is it also slower than CPU?

DuncanMcBain commented 6 years ago

I don't believe it is, though I don't have any numbers to hand at the moment (I don't have that hardware, and we don't test it internally, but I think we've done some ad-hoc tests).

lukeiwanski commented 6 years ago

@enihcam / @DuncanMcBain after Neo driver was released we have Skylake series SoC available for tests and benchmarks - there is nothing ad-hoc about this ;) @enihcam is there any particular benchmark / model you are interested in?

enihcam commented 6 years ago

@lukeiwanski yes, i'm going to install it on KabyLake (i5-7200U) :D

For AMD SoC, I'm wondering, is there any flags required to be turn on (or off) in kernel config? I ask this because all my linux boxes are using customized kernel. config.txt

mirh commented 6 years ago

As long as AMDGPU and AMDKFD are there, I don't think there's any other particular requirement for it to perform "properly". The thing is, computecpp might just be optimized for the "big dedicated gpu" scenario, rather than "tiny shared" one.

I'm not sure how much of ROCm or HSA Kabini supports, anyway many features should be already exposed via opencl. And if you care for it, as they told you, you should get aboard the profiling train.

EDIT: also of a fun fact, fglrx used to support 2.0 there once upon a time

enihcam commented 6 years ago

@mirh Aha! That explains why Kabini GPU is slow. It does NOT support HSA (i.e. AMDKFD)!!

mirh commented 6 years ago

Nothing of that is used at all here in the first place. Then, even though you are right Jaguar/Puma apus don't support HSA ~~(as for KFD, which is way more than just that, things may or may not improve in the future depending on how extensively AMD will be able to "backport" the thing~~) I was just suggesting some room for Fine-Grain SVM buffer optimizations. (which should also be more or less the same feature level of Intel Gen8 igps)

lukeiwanski / tensorflow