Closed ljhzxc closed 4 years ago
This is already reported, decoder takes a longer time than we would like on large-alphabet languages.
I have asked a man who trained the Chinese model serveral months ago. He said he also meet the same problem. And his solution is reducing the beam_width to < 50. But no matter how I change the beam_width(defalut 1024 -> 128 -> 10 -> 3),the inference time is still about 2000s. Every time I changed beam_width, I will comment out the --train_files and --dev_files in the run script and rerun the script. Then I get the new model in the {export dir} and use it to infer. Is this the right way to get the new model?
I have asked a man who trained the Chinese model serveral months ago. He said he also meet the same problem. And his solution is reducing the beam_width to < 50.
Do you have links ?
But no matter how I change the beam_width(defalut 1024 -> 128 -> 10 -> 3),the inference time is still about 2000s.
Are you sure you are doing it right ?
Then I get the new model in the {export dir} and use it to infer. Is this the right way to get the new model?
I'm not sure I understand your question here.
One way of reducing the inference time on large alphabets is to set a lower cutof_top_n (https://github.com/mozilla/DeepSpeech/blob/master/native_client/deepspeech.cc#L340). Of course, you will have to compile the decoder if you make this change.
I have asked a man who trained the Chinese model serveral months ago. He said he also meet the same problem. And his solution is reducing the beam_width to < 50.
Do you have links ?
But no matter how I change the beam_width(defalut 1024 -> 128 -> 10 -> 3),the inference time is still about 2000s.
Are you sure you are doing it right ?
Then I get the new model in the {export dir} and use it to infer. Is this the right way to get the new model?
I'm not sure I understand your question here.
Sorry, I do not have that link. I'm not sure that I do it right. So, I want to describe how I do it. {Then I get the new model in the {export dir} and use it to infer. Is this the right way to get the new model}
Maybe I didn't describe it clearly.
I mean, I want to change the beam width but I'm not sure where I should change the code. I only find it in one place.(parameter --beam_width in train processing which are listed in the session #Train of my problem) So, I change this beam_width parameter and do not train. Just run the script(compared to the last run script: comment out the two lines --train_files, --dev_files, and change --beam_width) python3 -u DeepSpeech.py --test_files /home/lujiahui/DeepSpeech/data/aishell2/csv/aishell2_test_10.csv --train_batch_size 64 --dev_batch_size 32 --test_batch_size 2 --learning_rate 0.00005 --n_hidden 1024 --es_steps 6 --epochs 700 --alphabet_config_path /home/lujiahui/DeepSpeech/data/new_alphabet.txt --checkpoint_dir {...} --export_dir {...} --beam_width 10 # change from 128 to 10 "$@"
Then, I can get a new model in the export_dir. I want to know is this new model with the new beam_width which I just change?
I find a strange phenomenon. When I using gpu-arch deepspeech to infer(using just one Telsa V100). Almost 32G memory of GPU is used, but the GPU-Util is 0%. Is it a cuda/cudnn compatibility issue?
So, I change this beam_width parameter and do not train. Just run the script(compared to the last run script: comment out the two lines --train_files, --dev_files, and change --beam_width)
I'm sorry but I don't understand what you are saying ? Do you change beam_width
at export time ? That's not enough, as @bernardohenz mentionned above.
I find a strange phenomenon. When I using gpu-arch deepspeech to infer(using just one Telsa V100). Almost 32G memory of GPU is used, but the GPU-Util is 0%. Is it a cuda/cudnn compatibility issue?
Np, CUDA/CUDNN compatibility issue would just mean not working at all. Decoding is not done on GPU, and you have 32GB used because TensorFlow allocates everything it can, by default. And it's at 0% of utilization because ... decoding is done on CPU.
Maybe this topic shouldn't go here, but: On the use of VRAM, is there any way not to assign everything when inferring? Make a modification to the code "config.gpu_options.allow_growth = True" and it works, but only to train the model.
Maybe this topic shouldn't go here, but: On the use of VRAM, is there any way not to assign everything when inferring? Make a modification to the code "config.gpu_options.allow_growth = True" and it works, but only to train the model.
That's not really the problem here ...
Thank you, @bernardohenz. I change the width and cutoff_top_n in the deepspeech.cc. It seems the reducing the beam_width is working. When I change the beam_width from 500(default) to 16, the decode time changes from 2000+s to 54s(source audio is about 3s). If beam_width is 4, decode time will be 15s. But it seems no use when I reduce the cutoff_top_n. How I can get smaller decode time other than reducing beam_width?
This is due to the large alphabet you're using. The UTF-8 mode recently introduced should be able to handle this. For Mandarin I recommend using cutoff_prob=0.99 as well to speed up decoding. In the near future I'll write some docs for UTF-8 based training.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Train
Here is my train parameters: python3 -u DeepSpeech.py \ --train_files /home/lujiahui/DeepSpeech/data/aishell2/csv/aishell2_train.csv \ --dev_files /home/lujiahui/DeepSpeech/data/aishell2/csv/aishell2_dev.csv \ --test_files /home/lujiahui/DeepSpeech/data/aishell2/csv/aishell2_test_10.csv \ --train_batch_size 64 \ --dev_batch_size 32 \ --test_batch_size 2 \ --learning_rate 0.00005 \ --n_hidden 1024 \ --es_steps 6 \ --epochs 700 \ --alphabet_config_path /home/lujiahui/DeepSpeech/data/new_alphabet.txt \ --checkpoint_dir {...} \ --export_dir {...} \ --beam_width 128 "$@" The final train loss is about 6.3 and cross-validation loss is about 14.5.
Inference
I get the executable file deepspeech by util/taskcluster.py python3 util/taskcluster.py --arch gpu --target ~/DeepSpeech/deepspeech_binary/gpu
I use the output model in the above {export_dir} to do the inference.
My inference command is ~/DeepSpeech/deepspeech_binary/gpu/deepspeech \ --model ~/ds_result/aishell2/export/export_1024_64_32_2/output_graph.pb \ --alphabet ~/DeepSpeech/data/new_alphabet.txt \ --audio ~/AISHELL-2/iOS/data/wav/C9329/IC9329W0446.wav \ -t 2>&1
Inference Log
TensorFlow: v1.14.0-14-g1aad02a DeepSpeech: v0.6.0-alpha.5-50-g90c2acd Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage. 2019-09-16 14:18:03.539524: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-09-16 14:18:03.541559: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2019-09-16 14:18:06.427730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:18:00.0 2019-09-16 14:18:06.428760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:86:00.0 2019-09-16 14:18:06.428772: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. 2019-09-16 14:18:06.432569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1 2019-09-16 14:18:06.961864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-16 14:18:06.961910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2019-09-16 14:18:06.961920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y 2019-09-16 14:18:06.961929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N 2019-09-16 14:18:06.966806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30428 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0) 2019-09-16 14:18:06.971983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30428 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0) 曹 天 蕉 菜 谱 有 什 么 cpu_time_overall=2014.25946