mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.36k stars 3.97k forks source link

GPU Inference take too long: about 2000s one audio #2367

Closed ljhzxc closed 4 years ago

ljhzxc commented 5 years ago

Train

Here is my train parameters: python3 -u DeepSpeech.py \ --train_files /home/lujiahui/DeepSpeech/data/aishell2/csv/aishell2_train.csv \ --dev_files /home/lujiahui/DeepSpeech/data/aishell2/csv/aishell2_dev.csv \ --test_files /home/lujiahui/DeepSpeech/data/aishell2/csv/aishell2_test_10.csv \ --train_batch_size 64 \ --dev_batch_size 32 \ --test_batch_size 2 \ --learning_rate 0.00005 \ --n_hidden 1024 \ --es_steps 6 \ --epochs 700 \ --alphabet_config_path /home/lujiahui/DeepSpeech/data/new_alphabet.txt \ --checkpoint_dir {...} \ --export_dir {...} \ --beam_width 128 "$@" The final train loss is about 6.3 and cross-validation loss is about 14.5.

Inference

I get the executable file deepspeech by util/taskcluster.py python3 util/taskcluster.py --arch gpu --target ~/DeepSpeech/deepspeech_binary/gpu

I use the output model in the above {export_dir} to do the inference.

My inference command is ~/DeepSpeech/deepspeech_binary/gpu/deepspeech \ --model ~/ds_result/aishell2/export/export_1024_64_32_2/output_graph.pb \ --alphabet ~/DeepSpeech/data/new_alphabet.txt \ --audio ~/AISHELL-2/iOS/data/wav/C9329/IC9329W0446.wav \ -t 2>&1

Inference Log

TensorFlow: v1.14.0-14-g1aad02a DeepSpeech: v0.6.0-alpha.5-50-g90c2acd Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage. 2019-09-16 14:18:03.539524: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-09-16 14:18:03.541559: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2019-09-16 14:18:06.427730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:18:00.0 2019-09-16 14:18:06.428760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:86:00.0 2019-09-16 14:18:06.428772: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. 2019-09-16 14:18:06.432569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1 2019-09-16 14:18:06.961864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-16 14:18:06.961910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2019-09-16 14:18:06.961920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y 2019-09-16 14:18:06.961929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N 2019-09-16 14:18:06.966806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30428 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0) 2019-09-16 14:18:06.971983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30428 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0) 曹 天 蕉 菜 谱 有 什 么 cpu_time_overall=2014.25946

lissyx commented 5 years ago

This is already reported, decoder takes a longer time than we would like on large-alphabet languages.

ljhzxc commented 5 years ago

I have asked a man who trained the Chinese model serveral months ago. He said he also meet the same problem. And his solution is reducing the beam_width to < 50. But no matter how I change the beam_width(defalut 1024 -> 128 -> 10 -> 3),the inference time is still about 2000s. Every time I changed beam_width, I will comment out the --train_files and --dev_files in the run script and rerun the script. Then I get the new model in the {export dir} and use it to infer. Is this the right way to get the new model?

lissyx commented 5 years ago

I have asked a man who trained the Chinese model serveral months ago. He said he also meet the same problem. And his solution is reducing the beam_width to < 50.

Do you have links ?

But no matter how I change the beam_width(defalut 1024 -> 128 -> 10 -> 3),the inference time is still about 2000s.

Are you sure you are doing it right ?

Then I get the new model in the {export dir} and use it to infer. Is this the right way to get the new model?

I'm not sure I understand your question here.

bernardohenz commented 5 years ago

One way of reducing the inference time on large alphabets is to set a lower cutof_top_n (https://github.com/mozilla/DeepSpeech/blob/master/native_client/deepspeech.cc#L340). Of course, you will have to compile the decoder if you make this change.

ljhzxc commented 5 years ago

I have asked a man who trained the Chinese model serveral months ago. He said he also meet the same problem. And his solution is reducing the beam_width to < 50.

Do you have links ?

But no matter how I change the beam_width(defalut 1024 -> 128 -> 10 -> 3),the inference time is still about 2000s.

Are you sure you are doing it right ?

Then I get the new model in the {export dir} and use it to infer. Is this the right way to get the new model?

I'm not sure I understand your question here.

Sorry, I do not have that link. I'm not sure that I do it right. So, I want to describe how I do it. {Then I get the new model in the {export dir} and use it to infer. Is this the right way to get the new model}

Maybe I didn't describe it clearly.

I mean, I want to change the beam width but I'm not sure where I should change the code. I only find it in one place.(parameter --beam_width in train processing which are listed in the session #Train of my problem) So, I change this beam_width parameter and do not train. Just run the script(compared to the last run script: comment out the two lines --train_files, --dev_files, and change --beam_width) python3 -u DeepSpeech.py --test_files /home/lujiahui/DeepSpeech/data/aishell2/csv/aishell2_test_10.csv --train_batch_size 64 --dev_batch_size 32 --test_batch_size 2 --learning_rate 0.00005 --n_hidden 1024 --es_steps 6 --epochs 700 --alphabet_config_path /home/lujiahui/DeepSpeech/data/new_alphabet.txt --checkpoint_dir {...} --export_dir {...} --beam_width 10 # change from 128 to 10 "$@"

Then, I can get a new model in the export_dir. I want to know is this new model with the new beam_width which I just change?

ljhzxc commented 5 years ago

I find a strange phenomenon. When I using gpu-arch deepspeech to infer(using just one Telsa V100). Almost 32G memory of GPU is used, but the GPU-Util is 0%. Is it a cuda/cudnn compatibility issue?

lissyx commented 5 years ago

So, I change this beam_width parameter and do not train. Just run the script(compared to the last run script: comment out the two lines --train_files, --dev_files, and change --beam_width)

I'm sorry but I don't understand what you are saying ? Do you change beam_width at export time ? That's not enough, as @bernardohenz mentionned above.

I find a strange phenomenon. When I using gpu-arch deepspeech to infer(using just one Telsa V100). Almost 32G memory of GPU is used, but the GPU-Util is 0%. Is it a cuda/cudnn compatibility issue?

Np, CUDA/CUDNN compatibility issue would just mean not working at all. Decoding is not done on GPU, and you have 32GB used because TensorFlow allocates everything it can, by default. And it's at 0% of utilization because ... decoding is done on CPU.

C5YS commented 5 years ago

Maybe this topic shouldn't go here, but: On the use of VRAM, is there any way not to assign everything when inferring? Make a modification to the code "config.gpu_options.allow_growth = True" and it works, but only to train the model.

lissyx commented 5 years ago

Maybe this topic shouldn't go here, but: On the use of VRAM, is there any way not to assign everything when inferring? Make a modification to the code "config.gpu_options.allow_growth = True" and it works, but only to train the model.

That's not really the problem here ...

ljhzxc commented 5 years ago

Thank you, @bernardohenz. I change the width and cutoff_top_n in the deepspeech.cc. It seems the reducing the beam_width is working. When I change the beam_width from 500(default) to 16, the decode time changes from 2000+s to 54s(source audio is about 3s). If beam_width is 4, decode time will be 15s. But it seems no use when I reduce the cutoff_top_n. How I can get smaller decode time other than reducing beam_width?

reuben commented 4 years ago

This is due to the large alphabet you're using. The UTF-8 mode recently introduced should be able to handle this. For Mandarin I recommend using cutoff_prob=0.99 as well to speed up decoding. In the near future I'll write some docs for UTF-8 based training.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.