Investigate GPUs on ARM boards

lissyx commented 6 years ago

We have ARM and ARM64 (#1305) arriving on RPi3B+ and LePotato boards. We should try and see if we can get GPU acceleration there:

OpenCL can work not too bad on Intel GPUs (cf ccpp) using ComputeCpp
LePotato has a MALI-450 GPU, no idea yet how much we can expect of OpenCL there
RPi3B (and + model) now have OpenCL 1.2 Embedded Profile compliant https://github.com/doe300/VC4CL#opencl-support and ComputeCpp has ARMv7 binaries targetting Ubuntu 14.04, so we might be able to get stuff to work there.

edubois commented 6 years ago

The Rock64Pro-AI has a mali-t860, I would consider this option.

lissyx commented 6 years ago

@edubois You're welcome to explore that, but I'll stick to what I have right now :)

edubois commented 6 years ago

Will do,

renepeinl commented 6 years ago

You state ARM boards but then mention Intel GPUs. Therefore, I would offer to do some testing on the UP board which is powered by a 4-core Atom x5-Z8350, has 4 GB of RAM and incorporates a Cherry Trail HD graphics with 12 execution units. We are using that, since it is compatible with the Matrix Voice, our hardware for sound capturing.

We are also looking at the Intel Movidius Neural Compute Stick, which is compatible with Tensor Flow and the Raspberry Pi 3, since the combination would be more cost-effective.

The Rock64Pro-AI looks interesting as well and would be even cheaper. I'm curious on the results of @edubois

lissyx commented 6 years ago

@renepeinl I have actually been experimenting for quite some time on Intel GPUs on my laptop, debugging and checking performances (thanks to CodePlay people and Intel people), so I know we can get it working with the "Neo" driver, which is far from being released yet, sadly. The Compute Stick is useless in our case, because of RNNs. The previous driver, Beignet, was a dead-end: not working with ComputeCpp (layer used by TensorFlow for OpenCL), and not being actively developped anymore by intel.

People who want to experiment should use the ccpp branch of our TensorFlow and DeepSpeech repo, but be aware it's hack in progress :)

renepeinl commented 6 years ago

Thanks for these information. Could you provide some hints on compiler flags for building the software as well? I'm not sure how much influence they have and we are mainly Java developers with no deeper knowledge about C++.

lissyx commented 6 years ago

@renepeinl it should be pretty simple, if you follow the docs we have in place and TensorFlow's building doc. For ComputeCpp branch, you'll need to download ComputeCpp matching version. Since it's hack in progress, I have not documented that, but you can look the tc-*.sh shell scripts in our TensorFlow's repo, it should contain everything. Basically, Bazel v0.10.0, ComputeCpp 0.5.1 (I think?) and proper ./configure flags (check tc-vars.sh mostly). On the DeepSpeech build side, it should not change OpenCL or not.

lissyx commented 6 years ago

@renepeinl If you run into issues, you can join us on IRC (#machinelearning on irc.mozilla.org) or on Discourse: https://discourse.mozilla.org/c/deep-speech

edubois commented 6 years ago

@renepeinl, the chip is not yet available, will start when I get one, probably in Sept.

lissyx commented 6 years ago

Good first milestone on RPi3:

I have llvm-spirv cross-built
I have been able to cross-compile vc4c, vc4clstdlib and vc4cl bits
Debian packages properly installs on Raspbian
ComputeCpp 0.7.0 for Ubuntu 14.04 / ARM32 shows VC4 GPU
Currently running the VC4C testsuite, there are failures, but there are successes as well, meaning the basics of the infra is there and working

lissyx commented 6 years ago

$ sudo ComputeCpp-CE-0.7.0-Ubuntu-14.04-ARM_32/bin/computecpp_info --verbose --use-spirv 
********************************************************************************

ComputeCpp Info (CE 0.7.0)

********************************************************************************

Toolchain information:

GLIBC version: 2.24
GLIBCXX: 20150426
This version of libstdc++ is supported.

********************************************************************************

Device Info:

Discovered 1 devices matching:
  platform    : <any>
  device type : <any>

--------------------------------------------------------------------------------
Device 0:

  Device is supported                     : NO - Device does not support SPIR
  CL_DEVICE_NAME                          : VideoCore IV GPU
  CL_DEVICE_VENDOR                        : Broadcom
  CL_DRIVER_VERSION                       : 0.4
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU 
  CL_DEVICE_VERSION                       : OpenCL 1.2 VC4CL 0.4
  CL_DEVICE_PROFILE                       : EMBEDDED_PROFILE
  CL_DEVICE_MAX_COMPUTE_UNITS             : 1
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS      : 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES           : 12 / 12 / 12
  CL_DEVICE_MAX_WORK_GROUP_SIZE           : 12
  CL_DEVICE_MAX_CLOCK_FREQUENCY           : 300 MHz
  CL_DEVICE_ADDRESS_BITS                  : 32
  CL_DEVICE_HOST_UNIFIED_MEMORY           : YES
  CL_DEVICE_MAX_MEM_ALLOC_SIZE            : 76 MByte
  CL_DEVICE_GLOBAL_MEM_SIZE               : 76 MByte
  CL_DEVICE_ERROR_CORRECTION_SUPPORT      : NO
  CL_DEVICE_LOCAL_MEM_TYPE                : global
  CL_DEVICE_LOCAL_MEM_SIZE                : 77824 KByte
  CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE      : 77824 KByte
  CL_DEVICE_QUEUE_PROPERTIES              : CL_QUEUE_PROFILING_ENABLE
  CL_DEVICE_IMAGE_SUPPORT                 : NO
  CL_DEVICE_MAX_READ_IMAGE_ARGS           : 64
  CL_DEVICE_MAX_WRITE_IMAGE_ARGS          : 64
  CL_DEVICE_IMAGE2D_MAX_WIDTH             : 2048
  CL_DEVICE_IMAGE2D_MAX_HEIGHT            : 2048
  CL_DEVICE_IMAGE3D_MAX_WIDTH             : 2048
  CL_DEVICE_IMAGE3D_MAX_HEIGHT            : 2048
  CL_DEVICE_IMAGE3D_MAX_DEPTH             : 2048
  CL_DEVICE_PREFERRED_VECTOR_WIDTH        : CHAR 16 SHORT 16 INT 16 LONG 0 FLOAT 16 DOUBLE 0 
  CL_DEVICE_EXTENSIONS                    : cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_nv_pragma_unroll cl_arm_core_id cl_ext_atomic_counters_32 cl_khr_initialize_memory

If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v0.7.0/platform-support-notes

********************************************************************************

lissyx commented 6 years ago

Some samples from the TestVC4C testsuite:

$ sudo LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH ./TestVC4C
./example/fft2_2.cl
./example/fibonacci.cl
./example/fibonacci.spt
[W] Fri May 18 15:26:19 2018: Failed to remove empty basic block: label: %13
[W] Fri May 18 15:26:19 2018: Block has explicit predecessor: br %13
./example/hello_world.cl
./example/hello_world_vector.cl
./example/test.cl
[W] Fri May 18 15:26:27 2018: Warnings in precompilation:
[W] Fri May 18 15:26:27 2018: ./example/test.cl:27:6: warning: incompatible pointer to integer conversion initializing 'int' with an expression of type 'int *'; remove &
        int n = &i;
            ^   ~~
1 warning generated.

./example/test_instructions.cl
./example/test_prime.cl

lissyx commented 6 years ago

After fighting with TensorFlow's ComputeCpp branch for cross-compiling for RPi3, I got something being built. It's running as of now, no idea what we can expect so far, both in term of output and in term of speed:

pi@rpi3-opencl-20180518:~/deepspeech $ sudo ./deepspeech ~/tmp/deepspeech/models/tf14.frozen.494_e120.LSTM.ldc93s1.pb ~/tmp/deepspeech/models/alphabet.txt ~/tmp/deepspeech/audio/ -t
TensorFlow: v1.8.0-rc1-1904-g9989353054
DeepSpeech: v0.2.0-alpha.5-0-g7cc8382
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-05-23 13:54:53.988212: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-05-23 13:54:53.988803: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: VideoCore IV GPU, vendor: Broadcom, profile: EMBEDDED_PROFILE
Running on directory /home/pi/tmp/deepspeech/audio/
> /home/pi/tmp/deepspeech/audio//2830-3980-0043.wav
2018-05-23 13:54:54.205643: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.227152: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.279121: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.300364: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.322776: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.

lissyx commented 6 years ago

Current status is that we are starting to seriously compile TensorFlow's SYCL kernel, and we are hitting some issues in the VC4CL driver :-)

lissyx commented 6 years ago

A lot of errors were fixed in VC4C and VC4CL, we're hitting issue with LLVM mangling of SPIRV, and with a workaround there's some error about register allocation after.

lissyx commented 6 years ago

No further progress on that: lack of time.

McHughes288 commented 6 years ago

Apologies @lissyx for jumping in on this thread. I am trying to get the Raspberry Pi's GPU visible when running the computecpp_info script but having trouble. Which OS are you running on the Pi3 (Raspbian Stretch is still only 32-bit so I'm guessing won't work properly with ComputeCpp)?

lissyx commented 6 years ago

@McHughes288 I was running Stretch, there was ARM32 ComputeCpp available. FYI it's still stalling, because all the basics are covered and this was now only being blocked by the VC4CL driver not able to digest our kernels. I'm off until september 5th, by then I'll be able to play again with the new, simpler model. Hopefully we'll see some breakthrough.

lissyx commented 5 years ago

@edubois @renepeinl So, I've ordered a ROCKPro64 :-)

edubois commented 5 years ago

Cool @lissyx , I'm still waiting for the AI version (RockPro-64-ai)

lissyx commented 5 years ago

The one I ordered is supposed to have a NPU with NNAPI support, is yours different?

edubois commented 5 years ago

I'm not sure how different they are, but there's two declination of the Rock64Pro: The Rock64Pro having a Rockchip RK3399 and the AI version with a Rockchip RK3399Pro. I might be wrong, tell me if you think I am.

lissyx commented 5 years ago

Woops, I might have been mislead by the naming ROCKPro64 :/

bkmgit commented 5 years ago

TF-coriander may be of interest as an OpenCL version of Tensor flow

lissyx commented 5 years ago

Thanks but we want to avoid forks, and ot seems the most active codebase is ComputeCpp one, yet its quite outdated (1.8 last time, coriander is 0.18 so it's even much older).

renepeinl commented 5 years ago

speaking about outdated versions: the ccpp branch on github looks quite outdated. Is it still the best starting point for getting GPU support on Intel GPUs to work?

lissyx commented 5 years ago

@renepeinl Sorry, but this was published with no more than best-effort for those brave enough to play. Honestly, the OpenCL support of TensorFlow does not looks like a huge priority upstream, so we are focusing efforts elsewhere.

lissyx commented 5 years ago

See also #2270 for some TFLite related experiments / progresses.

lissyx commented 5 years ago

FTR, with RPi4 and switching to TFLite runtime, we exceed realtime with only one core at 100%. So the incentive to leverage GPU on those boards is getting lower.

sbrl commented 5 years ago

Not everyone can afford to upgrade to an RPi 4 :confused:

Also, what if the GPU is being used for something else? And even if the CPU isn't maxed out, would using the GPU anyway yield faster results at all? Worth investigating perhaps?

lissyx commented 5 years ago

Not everyone can afford to upgrade to an RPi 4 confused

We use the same builds for RPi3 and RPi4, so the improvement will benefit also to those users. It's just that on RPi3 we can't provide faster than realtime.

Also, what if the GPU is being used for something else?

I suspect you were referring to CPU. Well, as I said, there was plenty of room for other operations.

would using the GPU anyway yield faster results at all? Worth investigating perhaps?

If you read the history, you will learn I already spent several weeks investigating using OpenCL on those boards (and others), and that there are several road blocks.

On RPi3, the driver had so much limitation that it would be far far away from even being able to compile the model to something runnable.

So if you care about that, fund that development and/or insist on TensorFlow upstream to better support OpenCL.

lissyx commented 4 years ago

Not everyone can afford to upgrade to an RPi 4 confused

So, at the time of our previous exchanges, the situation was that the huge matrix/vector multiplications in our model would not be parallelized, because they depends on floats. And turning our model to int using TFLite's quantization tooling was broken from several places.

Things have moved, and I could verify that TensorFlow r2.2 relies on ruy library and that if you compile it properly, you can get TFLite runtime and RUY for those matrix/vector multiplications. I could verify we now leverage 4 threads on several ARM targets, including RPi3 running Raspbian.

There's still quite some plumbery work, but we should have something available sooner than later now.

lissyx commented 4 years ago

Not everyone can afford to upgrade to an RPi 4 confused

So, at the time of our previous exchanges, the situation was that the huge matrix/vector multiplications in our model would not be parallelized, because they depends on floats. And turning our model to int using TFLite's quantization tooling was broken from several places.

Things have moved, and I could verify that TensorFlow r2.2 relies on ruy library and that if you compile it properly, you can get TFLite runtime and RUY for those matrix/vector multiplications. I could verify we now leverage 4 threads on several ARM targets, including RPi3 running Raspbian.

There's still quite some plumbery work, but we should have something available sooner than later now.

This revealed a nasty bug in the ruy library used by tensorflow (only exposed on armv7 running nodejs v11 and above), but now it should work as expected: this is the PR leveraging this effort https://github.com/mozilla/DeepSpeech/pull/2952

With TensorFlow r2.2, we can get GPUs (not only) delegation so any feedback could be welcome on https://github.com/mozilla/DeepSpeech/issues/2270

If people are brave enough to test and give some feedback that would help asserting the need and improvements one can expect.

mozilla / DeepSpeech

Investigate GPUs on ARM boards #1346