mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.2k stars 3.95k forks source link

Building native_client for ARM64: NVIDIA Jetson (with CUDA) #761

Closed elpimous closed 6 years ago

elpimous commented 7 years ago

752

@elpimous here: https://gist.github.com/lissyx/c008b43fd808d132989ec4d238d664fb

You'll have to change DEEPSPEECH_ROOT to your deepspeech root, maybe create a Debian ARMv8 Sid tree under DEEPSPEECH_ROOT/multistrap-debian_arm64-sid/ using multistrap: https://gist.github.com/lissyx/f007de2fff24b5af219ce07d283ebe0f

It will involves some hacking, since you need to pre-populate the debian keyring files into the directories multistrap-debian_arm64-sid/etc/apt/trusted.gpg.d/ and you will also have to do some magic using qemu-static for ARM64 otherwhise when you multistrap your system will be left half-configured and a lot of process will not be properly configured, and nothing will work :

elpimous commented 7 years ago

If anyone could help in this process : working on ubuntu 16.04

changed DEEPSPEECH_ROOT to my dir (/home/nvidia/DeepSpeech) created inside, a folder "multistrap-debian_arm64-sid", and inside anothers "usr/include" and "usr/lib" don't know multistrap/patch, neither the second part of post thanks guys.

lissyx commented 7 years ago

The patches I shared are for cross-compilation. You need to look for documentation on how to build a debian chroot using multistrap.

elpimous commented 7 years ago

Thanks Lissyx, I'll have a look at this.

gvoysey commented 7 years ago

for what it's worth, i have just had to do this. I threw in the towel on a cross-compile, and just did it all on the TX1. @elpimous you'll need to use an 8 GB swapfile otherwise you can't do it, but compilation is otherwise straightforward. My compilation chain was:

N.B., the latest version of bazel will crash -- you should use 0.5.2, not 0.5.3. I followed this guide for the most part, transitioning over to the DeepSpeech docs when it was time to compile. This utils repo has some good utilities to help you here.

If @lissyx, @kdavis-mozilla or other real project maintainers are OK with it, i am happy to contribute my binaries if you give me somewhere to put them.

elpimous commented 7 years ago

@gvoysey Very good !!!

Well, on my tx2, All work nice. (Thanks again Lissyx for patience !!) Model creation works nicely with GPU (4 times faster than my core M5 7200U) For TX2 owners : 40 epochs of about 1h30 wav's, with 5mn tests and 10 mn valid (nearly1650 wav's) = 1h14 Super...for a small 15w board !

But, as Lissyx told me, native_speech inference only work with cpu (bad for a nice small gpu board !)

A decoding voice takes nearly 0.90s (sure it could be /6 with gpu usage)

You had success on GPU install of native speech ?

gvoysey commented 7 years ago

I built tensorflow with CUDA support, but I do not know if that lets the native client use the GPU on arm8. @lissyx : can you provide any clarification on this?

elpimous commented 7 years ago

@gvoysey If you have to reinstall Tensorflow on Jetson, Jetsonhacks page has automatic install process : https://github.com/jetsonhacks/installTensorFlowTX2 My version is 1.2

lissyx commented 7 years ago

We really have no support for ARM with GPU, first. And I don't have any ARM board with CUDA support. If tensorflow supports it (I have not investigated that), it should work. ARMv8 by itself is not a problem, I have been cross compiling that with success targeting RPi3 with Debian Sid system. I'm interested though if you have any success or failure testing that!

lissyx commented 7 years ago

Regarding contributing binaries, I really don't understand why. We already provide ARMv6 binaries that are tested to work on RPi3. Do you want to share ARMv8 binaries? I would not recommend that, it would be better to have proper support that we can cross compile on TaskCluster. My testing has not revealed any speed improvement from ARMv8 use ; our ARMv6 binaries cross built on TaskCluster are already using the proper NEON FPU and that was measured to really speed up things.

gvoysey commented 7 years ago

@lissyx i'll confirm with nvidia-smi if the GPU is being used on the TX1 and report back.

My offer to contribute binaries was based on two things:

  1. the armv6 binaries i downloaded don't run (bash reports file not found or invalid binary when you try to execute them)
  2. I needed python 3.6 wheels for the bindings, not 2.7.
lissyx commented 7 years ago

Thanks @gvoysey. I guess that (1) is just because you run ARMv8 system, so that would explain. I thought that one could rely on multiarch debian support on ARM as well, i.e., running ARMv6 on an ARMv8 system, provided you have the proper dependencies installed. I would suggest you verify this first, because I don't really know how much we want to support ARMv8 right now. At least, all ressources I can find document that it should work. (2) For now, I have not found a proper way to perform ARM cross-compilation of Python code, the best I could find would be projects relying on a Docker, which might not work very well with TaskCluster. We are only assuming, for now, that we target ARM systems only with the C++ libs, and not with the Python or Node bindings.

In all cases, while offering to contribute your binaries is welcome, I don't think we should do it that way: relying on on-device compilation is really not something I like. I bet it took you quite some time to build all of that :).

So, overall, for (1), I'd be happy to help you if you want to re-use my ARMv8 patches and that we land that ; this might even be upstreamed, if proven to be of any interest. IMHO, if you can get the ARMv6 binaries working with Debian's multiarch support, I'm not sold on the real need for native ARMv8 performances: I saw no improvements, and while it is not that hard to add support for that, committing to maintaining that is another story :). And for (2) we need to discuss whether we want Python and/or NodeJS bindings to work on ARM.

We can do (1) without doing (2) and vice-versa :)

elpimous commented 7 years ago

hi all. on a 30mo model, a 3 sec wav inference takes nearly 2s to process! Hope to use inference with gpu usage... (sorry, I can't do myself ! too poor knowledge !)

lissyx commented 7 years ago

I am closing that as WONTFIX for now, since we have no short or mid-term of having support for ARMv8 hardware. However, for posterity, @elpimous has been able to build for ARMv8 with CUDA support.

He also has been able to successfully get ARMv6 binaries to work on ARMv8 (Ubuntu for Tegra) through multilib:

Then one just has to download and extract the ARMv6 native_client.tar.xz and the following should work:

$ LD_LIBRARY_PATH=$(pwd):/lib/arm-linux-gnueabihf:/usr/lib/arm-linux-gnueabihf:$LD_LIBRARY_PATH ./deepspeech
elpimous commented 7 years ago

Thanks again Lissyx.

benchs :

unfinished model :
30mo, n_hidden 494, sentence to find : as tu une intelligence artificielle supérieure a celle des humains (250K/S mono 16 5.sec)

ARMv6 : time LD_LIBRARY_PATH=$(pwd):/lib/arm-linux-gnueabihf/:/usr/lib/arm-linux-gnueabihf/:$LD_LIBRARY_PATH ./deepspeech /home/nvidia/Documents/elpimous_utils/output_graph.pb /home/nvidia/Documents/elpimous_utils/record.46.wav /home/nvidia/DeepSpeech/data/alphabet.txt -t as tu une intelligence artificielle sup'rieure axele des humains real 0m4.459s user 0m7.752s sys 0m0.380s

ARMv8 : time ./deepspeech /home/nvidia/Documents/elpimous_utils/output_graph.pb /home/nvidia/Documents/elpimous_utils/record.46.wav /home/nvidia/DeepSpeech/data/alphabet.txt -t

as tu une intelligence artificielle sup'rieure axele des humains real 0m4.698s user 0m8.008s sys 0m0.380s

ARMv8+CUDA:

2017-09-01 19:21:05.257405: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero 2017-09-01 19:21:05.257774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: name: GP10B major: 6 minor: 2 memoryClockRate (GHz) 1.3005 pciBusID 0000:00:00.0 Total memory: 7.67GiB Free memory: 2.52GiB 2017-09-01 19:21:05.257834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 2017-09-01 19:21:05.257893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y 2017-09-01 19:21:05.257944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GP10B, pci bus id: 0000:00:00.0) as tu une intelligence artificielle sup'rieure axele des humains real 0m3.501s user 0m2.516s sys 0m1.272s

Enjoy. Vincent

gvoysey commented 7 years ago

@elpimous nice, congrats! ... is there any way you could share your compiled native client with me?

lissyx commented 7 years ago

Hello, @gvoysey, you should give a try first to multilib use of our pre-built binaries, it makes it more future-proof for you, and as @elpimous documented there is no performance impact. Unless you are in a case you absolutely need ARMv8 binaries, that would be the safest :)

gvoysey commented 7 years ago

Hi @lissyx, agreed on your points. I missed the upthread multilib comment.

Running on a Nvidia TX-1 with CUDA 8 and libcudnn 5:

LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:/media/ubuntu/SD_NEU_00/gvoysey/native-client:/lib/arm-linux-gnueabihf/:/usr/lib/arm-linux-gnueabihf:/usr/lib/aarch64-linux-gnu ./deepspeech "/path/to/output_graph.pb" "../foo.wav" -t

exits with Segmentation fault immediately.

I'm using the latest build. I tried the previous build on taskcluster, which hung for 90 seconds, then segfaulted.

I'm not sure what's going on, honestly.

lissyx commented 7 years ago

@gvoysey First obvious thing is that you probably seems to be lacking the alphabet :)

gvoysey commented 7 years ago

@lissyx i was indeed missing alphabet.txt, but now i am not. Still segfaulting however:

ubuntu@tegra-ubuntu:/media/ubuntu/SD_NEU_00/gvoysey/native-client$ ll
total 113916
drwxrwxr-x 2 ubuntu ubuntu     4096 Sep  5 13:11 ./
drwxrwxr-x 7 ubuntu ubuntu     4096 Sep  5 12:21 ../
-rw-rw-r-- 1 ubuntu ubuntu      329 Sep  1 15:10 alphabet.txt
-rwxr-xr-x 1 ubuntu ubuntu    12160 Sep  1 15:10 deepspeech*
-r-xr-xr-x 1 ubuntu ubuntu    38652 Sep  1 15:08 libdeepspeech.so*
-r-xr-xr-x 1 ubuntu ubuntu    27880 Sep  1 15:04 libdeepspeech_utils.so*
-r-xr-xr-x 1 ubuntu ubuntu 96349276 Aug 24 09:26 libtensorflow_cc.so*
lissyx commented 7 years ago

@gvoysey You need to pass alphabet.txt as a parameter :/

lissyx commented 7 years ago

BTW, if you want to use CUDA on ARM, then you cannot use the ARMv6 binaries we provide: those don't have CUDA support enabled.

gvoysey commented 7 years ago

:sweat_smile: so i do:

./deepspeech /path/to/output_graph.pb path/to/go-to-the-target.wav ./alphabet.txt -t
cpu_time_overall=19.78046 cpu_time_mfcc=0.00681 cpu_time_infer=19.77365
time ./deepspeech /path/to/output_graph.pb path/to/go-to-the-target.wav ./alphabet.txt -t
real    0m12.366s
user    0m19.300s
sys 0m1.760s

this works -- but it takes ~19 seconds to transcribe a 1.62 second wav file. This isn't comparable to the benchmarks posted by @elpimous, who was able to use CUDA, which is why i was pestering them about how they got it to work.

elpimous commented 7 years ago

@gvoysey Hi. First, did you sudo ./jetson_clocks.sh ? and sudo nvpmodel -m 0

Well, as Lissyx said previously, we (Lissyx and I) had success with multilibs on TX2 (means directly use native_client binaries !).....but without cuda ! Tests are with mentioned specs : a 5s wav mono 16000, an 30 mo model, a very limited LM (Context3) Curious : 16s for a simple 1.6s ??? (perhaps a big LM ? my one was a 10 sentences !)

Now, for TX2 Arm8, Cuda..

Tensorflow : did you follow jetsonhacks posts, for tensorflow TX2 install ?? https://github.com/jetsonhacks/installTensorFlowTX2 You download some scripts : (I modified each calls to tensorflow v1.0.1 to v1.2.0, even in wheels) If done correctly, tensorflow ./configure does all the job.. Just 'enter' at python request when all finished, you validate TF install, and you'll see cuda working. Good.

Deepspeech mozilla :

copy deepspeech directory in /home (tensorflow and Deepspeech will be in /home)

cd tensorflow ln -s ../DeepSpeech/native_client ./

bazel build -c opt --copt=-march=native --copt=-mtune=native --config=cuda --copt=-O3 //tensorflow:libtensorflow_cc.so //native_client:*

cd ../DeepSpeech/native_client make deepspeech

PREFIX=/usr/local sudo make install (in your ld library)

make bindings sudo pip install dist/deepspeech*

Here is exactly what I did, to have Arm8+cuda working ! test this link to be sure to correctly install TF on TX2. https://github.com/jetsonhacks/installTensorFlowTX2

Tell me if it worked ! vincent

lissyx commented 7 years ago

@gvoysey what is the size of your trained model? Out-of-the-box DeepSpeech model would be compatible with the time you provide. Please keep in mind that @elpimous works on a specific case, a simpler model.

gvoysey commented 7 years ago

@lissyx it's trained on TED. output_graph.pb is ~ 400 MB. I can train a simpler model with significantly worse performance that's ~27 MB.

lissyx commented 7 years ago

So, TED, 400MB, I'm not surprised. You might get a nice speedup at the expense of having to rebuild libtensorflow_cc.so and libdeepspeech*.so for ARMv8+CUDA. The good news is that we know this is doable and should work, it's just a matter of CPU time :)

elpimous commented 7 years ago

@gvoysey : we have nearly same model size... You should approach my results..

YerongLi commented 6 years ago

@elpimous Have you figured out what's happening? I met a similar issue here https://devtalk.nvidia.com/default/topic/1025204/?comment=5229635 . And I know on successfully installed tensorflow on TX2 and python2.7 there is no NUMA error.

elpimous commented 6 years ago

Hi yerongli. Are U sure for no NUMA warning on a Tf install on TX2 ? I installed Tf 1.1. 1.2. 1.3 and 1.4. Each time i have this warning ! I seems to be a missing kernel part or a missing option for our ARMv8 procs. You are on Tx1 : hope you made at least 10go swapdisk..

I ll look on web for that..

YerongLi commented 6 years ago

Yes, I am suspecting that's a memory issue and every I run the test script https://github.com/jetsonhacks/installTensorFlowJetsonTX/blob/master/TX1/tensorflow-1.3.0-cp35-cp35m-linux_aarch64.whl. it will stil gives correct output. But the problem is tensorflow is extremely slow and the whole system becomes very slow (actually it stucks). Does you system becomes slow everytime you see the NUMBA warnings?

Here is the output from the script above:

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Saving graph to: /tmp/tmpvzmvs0m5
2017-12-30 06:22:48.342137: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero
2017-12-30 06:22:48.342261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: NVIDIA Tegra X1
major: 5 minor: 3 memoryClockRate (GHz) 0.9984
pciBusID 0000:00:00.0
Total memory: 3.89GiB
Free memory: 1.19GiB
2017-12-30 06:22:48.342308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-12-30 06:22:48.342338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-12-30 06:22:48.342369: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0)
step 0, training accuracy 0.06
YerongLi commented 6 years ago

I know a person working on TX2 and strictly speak, maybe he is using some old tensorflow on the original R28 from Jetpack 3.1 and he never sees the NUMA warning, I am sure about that...

elpimous commented 6 years ago

Hi. First, about NUMA :

Answer of a NVIDIA moderator team :

NUMA is for multi-GPU. # you just have 1 gpu named "0" It looks like your model wants to enable multi-GPU options. But NUMA is turn off (by default) when building, and TX2 only have one GPU.

It seems to be able to clear the warning, telling TF to work with only 1 GPU ! I prefer let the warning, Lol

Now, it seems that you are training ?! What BDD ? See your free RAM (to separate between cpu's and gpu !!) 1.19Go !!! Not enough ram -> swapdisk -> lots of R/W disk access...

You re trying a job who need a lot of ressources, and you can see your limits...but you'll have success !! Just let your nice board work at it rythm !

Ps : My TX2 is a bit more comfortable (8go ram), but be sure that i must frequently compose with the limits/warnings (OOMs)

YerongLi commented 6 years ago

@elpimous Thank you so much! You are right. After rebuilding TX1 kernel and add a 10GB swapfile, tensorflow works out fine.

elpimous commented 6 years ago

Cool, YerongLi. Enjoy this fabulous world, and happy new year.

dedoogong commented 6 years ago

Hello @elpimous ! I'm trying to run DeepSpeech on TX2 with Jetpack 3.2(9.0 CUDA, 7.0 cudnn).

I succeed in building TF 1.4.0 wtc with CUDA support( I got the tensorflow_warpctc-1.4.0-cp27-cp27mu-linux_aarch64.whl)

but I got an Double Conversion error when I tried building on native_client v0.1.1

bazel build -c opt --verbose_failures --copt=-march=native --copt=-mtune=native --config=cuda --copt=-O3 //native_client:deepspeech //native_client:deepspeech_utils //native_client:generate_trie

I installed arm-multilib as you mentioned above, so I can run prebuilt deep speech for armv6.

Can you tell me which version of DeepSpeech you built? or can you provide the prebuilt one? Or should I set DEEPSPEECH_ROOT?

I need to speed up DeepSpeech! Please help me~

Cheers, Seunghyun .

elpimous commented 6 years ago

dedoogong.

Hi. Try this : edit : DeepSpeech/blob/master/native_client/kenlm/util/double-conversion/utils.h

Add : defined(aarch64) ||

anything like it :

#if defined(_M_X64) || defined(__x86_64__) || \
    defined(__ARMEL__) || defined(__avr32__) || \
    defined(__hppa__) || defined(__ia64__) || \
    defined(__mips__) || defined(__powerpc__) || \
    defined(__sparc__) || defined(__sparc) || defined(__s390__) || \
defined(__SH4__) || defined(__alpha__) || defined(__aarch64__) || \

Enjoy

lissyx commented 6 years ago

@dedoogong We should be having cross-compiler for ARM64 with GCC 4.9 and 7.2 soon now, that might make it easier for you if you want to cross-build for inference. This is not yet ARM64 with CUDA for Jetson, but I think that if you configure for CUDA with this cross-compiling setup it might work.

Check issue #1305

gheorghelisca commented 6 years ago

Hi,

I have the following configuration:

I am trying to build the native_client for python and when I run the first command:

bazel build -c opt --copt=-O3 --copt="-D_GLIBCXX_USE_CXX11_ABI=0" //native_client:libctc_decoder_with_kenlm.so

from the readme then I am getting this error:

ERROR: /home/nvidia/data_1/bazel_cache/_bazel_nvidia/4b572e4626de8216768980dc63f03b80/external/jpeg/BUILD:269:1: C++ compilation of rule '@jpeg//:simd_armv7a' failed (Exit 1)
gcc: error: unrecognized command line option '-mfloat-abi=softfp'
Target //native_client:libctc_decoder_with_kenlm.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 95.915s, Critical Path: 0.96s
INFO: 10 processes: 10 local.
FAILED: Build did NOT complete successfully

I find wired that gcc is complaining about armv7a when in fact TX2 has armv8a (if I am not mistaken).

Could somebody please give me some ideas how to proceed further?

lissyx commented 6 years ago

@gheorghelisca This is not the right place for asking for help. The error message actually confirms you are building on ARMv8 with Aarch64 compiler, hence the inexistent float abi command line argument. I cannot help more, we don't support in-situ ARM building, please cross-compile.

lissyx commented 6 years ago

@gheorghelisca You should discuss that on the Discourse forum and specifically with @elpimous, he actually run this setup and I know he recently updated his work on current master without any issue.

gheorghelisca commented 6 years ago

@lissyx Thanks a bunch! ... I will get in touch with @elpimous .

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.