kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.11k stars 5.31k forks source link

Memory "leak" of cudadecoder's arc instantiations #4814

Closed git-bruh closed 1 year ago

git-bruh commented 1 year ago

Hi, I have recently been trying to track down progressive memory growth in Triton's Kaldi backend (https://github.com/NVIDIA/DeepLearningExamples/issues/1240), and in pursuit of that I've successfully tried to reproduce the issue with a bare Kaldi setup.

I don't have any understanding of Kaldi's internals, so some of the information given here might seem vague or might be outright non sensical but I hope it gives the general idea.

Basically, the issue seems to be that the cudadecoder keeps max_active number of arc instantiations per every computation of an audio chunk (frame computation), that never seems to get freed until the decoder's destructor is called.

As far as I can understand logically, the arc instantiations are relevant only for a given correlation id / audio stream, and there is no meaningful way to use these instantiations to improve the accuracy of other, unrelated audio streams / correlation ids. So, it seems fair to expect that all arc instantiations relating to a given correlation ID get freed once the last chunk for that ID has been processed. However this doesn't seem to happen practically.

This becomes a huge problem in the Triton Kaldi backend since it constantly takes in new inputs from clients, and the memory usage climbs rapidly with every inference (reaching up to 30G for large WAVs)

Steps to reproduce:

Use this shell script to launch an inference for the LibriSpeech dataset:

#!/bin/sh

# --max-active=10

/bin/time -v ./batched-wav-nnet3-cuda-online \
    --max-batch-size=1100 \
    --cuda-use-tensor-cores=true \
    --cuda-worker-threads=12 \
    --cuda-decoder-copy-threads=4 \
    --print-hypotheses \
    --cuda-use-tensor-cores=true \
    --main-q-capacity=30000 \
    --aux-q-capacity=400000 \
    --beam=10 \
    --cuda-worker-threads=10 \
    --num-channels=4000 \
    --lattice-beam=7 \
    --max-active=10000 \
    --frames-per-chunk=50 \
    --acoustic-scale=1.0 \
    --config=/data/models/LibriSpeech/conf/online.conf \
    --word-symbol-table=/data/models/LibriSpeech/words.txt \
    /data/models/LibriSpeech/final.mdl \
    /data/models/LibriSpeech/HCLG.fst \
    scp:/data/datasets/LibriSpeech/test_clean/wav_conv.scp \
    'ark:|gzip -c > /tmp/lat.gz'

Notice that the memory usage keeps climbing and remains constant after all the inferences have been performed. It only gets freed once the whole decoder object is destroyed. The expected behaviour is that the memory usage keeps fluctuating up and down as a consequence of properly releasing the memory for the arc instantiations of the correlation IDs that have been completely inferred.

The program's memory usage caps out at around 6G in case of max_active=10:

Command being timed: "./batched-wav-nnet3-cuda-online --max-batch-size=1100 ... **--max-active=10** ... ark:|gzip -c > /tmp/lat.gz"
        User time (seconds): 30.66
        ...
        Maximum resident set size (kbytes): **5989872**

I'm showing Maximum resident set size (i.e. the peak memory usage) because the usage actually never goes down after peaking due to the leak. This can be confirmed by adding a sleep before return-ing here: https://github.com/kaldi-asr/kaldi/blob/master/src/cudadecoderbin/batched-wav-nnet3-cuda-online.cc#L316

And at 8G in case of max_active=10000:

Command being timed: "./batched-wav-nnet3-cuda-online --max-batch-size=1100 ... **--max-active=10000** ... ark:|gzip -c > /tmp/lat.gz"             

        User time (seconds): 29.87
        ...
        Maximum resident set size (kbytes): **8204936**

This correlation between the memory usage and the value of max_active led me to believe that the arc instantiations are not being freed as soon as a given correlation ID's last chunk has been processed.

git-bruh commented 1 year ago

Related https://github.com/kaldi-asr/kaldi/issues/4723

galv commented 1 year ago

This becomes a huge problem in the Triton Kaldi backend since it constantly takes in new inputs from clients, and the memory usage climbs rapidly with every inference (reaching up to 30G for large WAVs)

Are you saying that you are trying to transcribe very large audios (e.g., at least several minutes long) using the kaldi cuda decoder? Memory usage should grow as a factor of O(length of audio * max_active) in that case.

I've looked around, and there is no obvious place where I would see the memory growing with the number of correlation IDs, unless the kaldi triton decoder is accidentally always making new correlation IDs (instead of overwriting old ones).

git-bruh commented 1 year ago

Are you saying that you are trying to transcribe very large audios (e.g., at least several minutes long) using the kaldi cuda decoder? Memory usage should grow as a factor of O(length of audio * max_active) in that case.

Yes, like say for testing I'm transcribing the same 3 minute file multiple times (also sequentially, not parallely) and each transcription uses up N gb of memory. But, the problem is that after the last chunk for that audio file has been sent and processed, that N gb is never freed.

So if I run the transcription for that 3min file M times one by one, then the memory usage will grow to N * M GB because that memory isn't freed until the program exits

Also as i said, this is reproducible with the batched-wav-nnet3-cuda-online program aswell

git-bruh commented 1 year ago

So to clarify a bit more, I'm going to paste the memory usage stats from the original issue:

1.mem: 6.9Gi
2.mem: 7.6Gi
3.mem: 8.3Gi
4.mem: 9.1Gi
5.mem: 9.8Gi
6.mem: 10Gi
7.mem: 11Gi
8.mem: 11Gi
9.mem: 12Gi
10.mem: 13Gi
11.mem: 14Gi
12.mem: 14Gi
13.mem: 15Gi
14.mem: 16Gi
15.mem: 16Gi
16.mem: 17Gi
17.mem: 18Gi
18.mem: 18Gi
19.mem: 19Gi
20.mem: 20Gi
21.mem: 20Gi
22.mem: 20Gi
23.mem: 20Gi
24.mem: 20Gi
25.mem: 20Gi
26.mem: 21Gi
27.mem: 21Gi
28.mem: 22Gi
29.mem: 22Gi
30.mem: 22Gi
31.mem: 22Gi
32.mem: 23Gi
33.mem: 24Gi
34.mem: 25Gi
35.mem: 25Gi
36.mem: 25Gi
37.mem: 26Gi
38.mem: 26Gi
39.mem: 27Gi
40.mem: 27Gi

Here, before the 1st run the server idles at ~5-6 GB memory, then after running an inference on that 3 minute file, it rises to ~7GB, so say the usage is around 1-1.5G per for a single inference of that wav file.

Now if the behaviour was as expected, the memory would first peak at 7G and then after the inference is complete, the memory would drop back down to ~5G, the same thing would happen for M inferences and the final memory usage would hover around the original mark i.e. 5-7G

However this doesn't happen, and on each request, around 1.5G of memory gets leaked

So N = 1 (GB) and M = 40 (sequential inferences), total would be 40G usage (It is only a 21G rise over the idle of ~6G here though, perhaps some of the old memory gets cleared, but not all since it keeps growing progressively even though the rate of growth varies)

galv commented 1 year ago

I am not 100% certain (I just tried to install eBPF and bcc tools on this computer to use the "memleak" tool but it's not working), but it is likely that the culprit is that the code to clear the host memory that grows with each frame does not happen until a new channel is loaded into where that old data's channel was: https://github.com/kaldi-asr/kaldi/blob/be22248e3a166d9ec52c78dac945f471e7c3a8aa/src/cudadecoder/cuda-decoder.cc#L487-L493

Can we simply do this at the end of AdvanceDecoding() for "completed" frames? Not very easily. There are two ways to get output: CudaDecoder::GetBestPath and CudaDecoder::GetRawLattice. The current contract is that these can be called anytime for a given channel before you overwrite it with a new one via a new InitDecoding(). That is to say, there is o FinalizeDecoding() call that would call the appropriate deleters as you are requesting.

I am hesitant to break the existing contract, assuming this is the issue.

Would simply reducing the --num-channels option work for you? When you set it to 4000, all 4000 channels will be filled before you start reusing channels (and thus calling InitDecoding()): https://github.com/kaldi-asr/kaldi/blob/be22248e3a166d9ec52c78dac945f471e7c3a8aa/src/cudadecoderbin/batched-wav-nnet3-cuda-online.cc#L162. (I can't find the source code right now, but I'm fairly certain that Triton also implements correlation IDs by simply incrementing a uint64_t for each new incoming stream).

Also, it's worth adding that virtual memory may make your concerns not realistic, since the data of an old stream is not used anyway. It's possible the std::vector copy assignment operator in InitDecoding() (when the stream does get "recycled") may touch the old data, but since these are simple types, they have no associated destructor. I am doubtful.

Another point: I am hesitatant to support your request given that you are apparently not using the cuda decoder in the expected way. We need to support thousands of audio streams for full performance. That means that your CPU server neeeds to be allocated with the appropriate amount of memory anyway.

Finally, I would recommend decoding audio of at most 1 minute in length. As I mentioned, memory usage grows linearly with the time of a file. It's common for people to do this in speech recognition. And it prevents an easy denial of service attack if a user uploads a very long file.

Happy to work further with you on this if you can show it's a real concern.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

galv commented 1 year ago

I spent last week looking into this but unfortunately could not reproduce it. @git-bruh, if you are still around and able to provide a more exact reproducer, I could take another look. What I can say is that there are users using this code in production without the problems you describe.

I did happen to find some undefined behavior when running under ubsan and asan, for which I will open up a PR.

git-bruh commented 1 year ago

@galv Thanks a lot for your effort, but unfortunately it won't be possible for me to spend time on this issue in the near future, apologies.

galv commented 1 year ago

For anyone else who comes across this issue, the problem is most likely that this user was using too large of audio files, causing a huge amount of memory to be used for store tokens for backtracking. If you are transcribing large audio files using batched-wav-nnet3-cuda2, check out the "--segment" option (and be sure to set an appropriate segment length, e.g., 15 seconds):

https://github.com/kaldi-asr/kaldi/blob/71f38e62cad01c3078555bfe78d0f3a527422d75/src/cudadecoderbin/batched-wav-nnet3-cuda2.cc#L88-L89