Faster Cuda Decoder - Githubissues

galv commented 1 year ago

There were several issues recently discovered with the cuda decoder in both offline and online mode.

After my fixes, I can achieve 7800 RTFx throughput on librispeech test-clean and the model https://kaldi-asr.org/models/m13 with an A100-80GB PCIe card in the offline mode of computation. Previously, because of some unnoticed software regressions, this number was as low as 4000 RTFx, which isn't bad, admittedly.

Latency is more complicated, but here is a preliminary result with this model https://kaldi-asr.org/models/m13 on librispeech test-clean:

This was achieved via the following hyperparameter sweep:

for chunk_size in 21 30 40 50; do
    for num_streaming_channels in 1000 2000 3000 4000 5000 6000; do
        max_batch_size=$((num_streaming_channels>4000 ? 4000 : num_streaming_channels))
        /home/dgalvez/scratch/code/asr/kaldi-a100-perf//src/cudadecoderbin/batched-wav-nnet3-cuda-online --num-channels=$((num_streaming_channels * 2)) --cuda-use-tensor-cores=true --main-q-capacity=30\
000 --aux-q-capacity=400000 --cuda-memory-proportion=0.5 --max-batch-size=$max_batch_size --cuda-worker-threads=12 --file-limit=-1 --cuda-decoder-copy-threads=4 --batching-copy-threads=8 --frame-subsam\
pling-factor=3 --frames-per-chunk=$chunk_size --max-mem=100000000 --beam=10 --lattice-beam=7 --acoustic-scale=1.0 --determinize-lattice=true --max-active=10000 --iterations=10 --file-limit=-1 --config=\
/home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//conf/online.conf --num-parallel-streaming-channels=$num_streaming_channels --word-symbol-table=/home/dgalvez/scratch/code/a\
sr/kaldi-a100-perf/workspace//models/LibriSpeech//words.txt /home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//final.mdl /home/dgalvez/scratch/code/asr/kaldi-a100-perf/worksp\
ace//models/LibriSpeech//HCLG.fst scp:/home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//datasets/LibriSpeech/test_clean//wav_conv.scp 'ark:|gzip -c > /tmp/results/LibriSpeech/52/0/lat.gz' # 2> \
output.log                                                                                                                                                                                                
        cat output.log | grep -A 1 "Latencies" | grep -v "Latencies" | awk 'BEGIN { OFS = ","; ORS = ""} {print $3,$4,$5,$6}' >> $result_file
        echo ",${chunk_size},${num_streaming_channels},${max_batch_size}" >> $result_file
    done
done

Do note that better results can be achieved sometimes by setting maximum batch size lower than the number of channels. Average latency is, of course, much smaller. This means users can do real-time decoding at 3000-4000 audio streams concurrently.

This is the "compute" latency. It doesn't include the time spent waiting for the right hand context (21 frames, or 210 ms in this case). The point is that it is incredibly fast.

galv commented 1 year ago

FYI, CI is faling with:

extras/check_dependencies.sh: python2.7 is not installed
extras/check_dependencies.sh: Some prerequisites are missing; install them using the command:
  sudo apt-get install python2.7
make: *** [Makefile:39: check_required_programs] Error 1

galv commented 1 year ago

Fixed CI (so far)

galv commented 1 year ago

FYI @danpovey you might find these very low latencies exciting. I'm going to be incorporating this into https://github.com/nvidia-riva/riva-asrlib-decoder (via the kaldi submodule within that project) so that CTC models (and hopefully something like your FSA-based RNN-T decoder) can benefit as well.

ravi-shanker-m commented 1 year ago

@galv thread-pool-light.h is deleted from the latest commit, but batched-threaded-nnet3-cuda-pipeline.h still calls it.

galv commented 1 year ago

@ravi-shanker-m that's been deprecated for a few years now: https://github.com/kaldi-asr/kaldi/blob/be22248e3a166d9ec52c78dac945f471e7c3a8aa/src/cudadecoder/batched-threaded-nnet3-cuda-pipeline.h#L35-L36

I'm happy to go ahead and remove that code.

Were you using it for some reason?

trunglebka commented 1 year ago

For this question, I've tried both cuda decoder v1 vs v2 and v1 give better RTF in my case so my old service using this implementation. Maybe nvidia provided kaldi docker with parameter optimized for their computing resource and I have not tried enough

galv commented 1 year ago

@trunglebka, I'm happy to provide advice if you give more detail. I would sincerely doubt that you reach anywhere near 8000 RTFx on the v1 cuda decoder on an A100 (or whatever GPU you are using).

The nvidia kaldi container is not anything special. It's just a pre-built kaldi from open source with some CI to make sure that nothing has broken. You can reproduce my work by running the librispeech model I linked in the first comment on librispeech test-clean, using the command line flags I specify.

trunglebka commented 1 year ago

I've retired from my old company. In my case, after some experiment of tuning parameters using nvidia kaldi docker with T4, v1 give me about 500 RTFx but v2 just about 350 RTFx. Due to deadline I do not have enough time to experiment more so I just pick V1. So I think it maybe the problem with choosing parameters.

galv commented 1 year ago

@trunglebka Okay. I found several performance problems with the v2 decoder during my work on making this PR and this is very close to the "speed of light", so I'm not concerned about the v1 decoder being any better than this one.

trunglebka commented 1 year ago

Yeah, just want to provide you context where v1 being used.

danpovey commented 1 year ago

FYI @danpovey you might find these very low latencies exciting. I'm going to be incorporating this into https://github.com/nvidia-riva/riva-asrlib-decoder (via the kaldi submodule within that project) so that CTC models (and hopefully something like your FSA-based RNN-T decoder) can benefit as well.

Yes, that's cool! Thanks!

jtrmal commented 1 year ago

@galv plz merge once you feel it's complete

stale[bot] commented 1 year ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

jtrmal commented 1 year ago

@galv good to merge?

zulkarneev commented 1 year ago

Hi, could you tell were ivectors used for 7800 RTFx? Config file for ivectors is not passed in the script above. And for chunk size = 30 batched-wav-nnet3-cuda-online gives Assertion failed: ("Please set --frames-per-chunk at least as large as the neural net " "right context" && input_frames_perchunk >= total_nnet_rightcontext)

danpovey commented 1 year ago

@zulkarneev does the same issue happen with the previous version of the decoder?

zulkarneev commented 1 year ago

Dan, what version do you mean?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

kaldi-asr / kaldi

Faster Cuda Decoder #4811