Open enihcam opened 6 years ago
Hi @enihcam
Thanks for the report. I will do my best to help. To clarify is that this device: https://www.techpowerup.com/gpudb/2197/radeon-hd-8280e ?
As well, do you know if the device you are using has physical local memory or is it using global memory to simulate it?
Thank you @lukeiwanski.
Yes it is. Also the full processor name is AMD A4-5000 APU with Radeon(TM) HD Graphics
.
Sorry, what do you mean 'global memory to simulate it'? Since it is an integrated GPU, it uses system RAM (DDR3) shared by processor.
Hi @enihcam, While it is true that the performance on your hardware is low, we think there are a few factors contributing to this. The iGPU in your SoC is barely more powerful than the CPU, and we should therefore expect performance that (at best) would be on-par with the CPU. However, given the design we have taken at the moment (focussing on discrete GPUs with many CUs and high memory bandwidth), it is likely that the code as-is will not perform well on an AMD APU.
More specifically, it seems likely to me that there will be some redundant copies on APU hardware (since the memory is shared between the CPU and GPU). For these reasons, I don't think you will obtain good performance on this hardware, even if (as is likely) there are still optimisations we could make to our TensorFlow efforts.
Are you using latest opencl-amd?
More specifically, it seems likely to me that there will be some redundant copies on APU hardware
Putting aside whatever specific low end consideration now (his gpu should crunch just short of 150 Gflops btw).. shouldn't you look into zero copy then if that happens?
That's certainly a possibility, but I don't imagine that this is an interesting optimisation target for us right now. That said I might be wrong - CodeXL might be able to provide some traces showing whether excessive time is being spent copying the buffers around.
Thank you @mirh @DuncanMcBain
Yes, I'm using latest opencl-amd (ver 18.10.572953). How to enable zero-copy?
It would be more instructive to be sure that this is the issue first than to delve into the guts when, indeed, this optimisation might already be in effect.
As I say, however, this hardware isn't currently an interesting target to us.
Also I would like to know what is the performance of tensorflow-computecpp on intel GPU? is it also slower than CPU?
I don't believe it is, though I don't have any numbers to hand at the moment (I don't have that hardware, and we don't test it internally, but I think we've done some ad-hoc tests).
@enihcam / @DuncanMcBain after Neo driver was released we have Skylake series SoC available for tests and benchmarks - there is nothing ad-hoc about this ;) @enihcam is there any particular benchmark / model you are interested in?
@lukeiwanski yes, i'm going to install it on KabyLake (i5-7200U) :D
For AMD SoC, I'm wondering, is there any flags required to be turn on (or off) in kernel config? I ask this because all my linux boxes are using customized kernel. config.txt
As long as AMDGPU and AMDKFD are there, I don't think there's any other particular requirement for it to perform "properly". The thing is, computecpp might just be optimized for the "big dedicated gpu" scenario, rather than "tiny shared" one.
I'm not sure how much of ROCm or HSA Kabini supports, anyway many features should be already exposed via opencl. And if you care for it, as they told you, you should get aboard the profiling train.
EDIT: also of a fun fact, fglrx used to support 2.0 there once upon a time
@mirh Aha! That explains why Kabini GPU is slow. It does NOT support HSA (i.e. AMDKFD)!!
Nothing of that is used at all here in the first place.
Then, even though you are right Jaguar/Puma apus don't support HSA (as for KFD, which is way more than just that, things may or may not improve in the future depending on how extensively AMD will be able to "backport" the thing) I was just suggesting some room for Fine-Grain SVM buffer optimizations.
(which should also be more or less the same feature level of Intel Gen8 igps)
Please go to Stack Overflow for help and support:
https://stackoverflow.com/questions/tagged/tensorflow
If you open a GitHub issue, here is our policy:
Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Archlinux
TensorFlow installed from (source or binary): source
TensorFlow version (use command below):
Python version: 3.6.5
Bazel version (if compiling from source): 0.12.0
GCC/Compiler version (if compiling from source): 7.3.1 20180406
CUDA/cuDNN version: N/A
GPU model and memory:
Exact command to reproduce:
python ./models/tutorials/image/cifar10/cifar10_train.py
You can collect some of this information using our environment capture script:https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the problem
Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.
For CPU-based tensorflow, it was around ~80 examples/sec.
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.
Build configuration: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=tensorflow-computecpp