k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
515 stars 103 forks source link

100% GPU 'freeze' with Zipformer #416

Open gabor-pinter opened 1 year ago

gabor-pinter commented 1 year ago

We are having an issue using using zipformer with multiple worker threads, looking like a livelock/busy deadlock situation:

Further notes:

I am attaching a trace log captured during a deadlock The node I am using has 8 virtual nodes, Sherpa uses 11 threads. Sherpa seem to be active on 4 threads:

#1  sherpa::OnlineZipformerTransducerModel::GetEncoderInitStates(...)
#11 sherpa::OnlineRecognizer::OnlineRecognizerImpl::DecodeStreams(...)
#15 sherpa::OnlineTransducerGreedySearchDecoder::Decode(...)
#18 sherpa::OnlineZipformerTransducerModel::RunEncoder(...)

Environment:

Nvidia driver version: 510.108.03
CUDA runtime version: 11.8.89
PyTorch version: 1.13.1+cu117
CUDA used to build PyTorch: 11.7
Is debug build: False
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CMake version: version 3.24.1
Libc version: glibc-2.31
Python version: 3.8.10 (default, Nov 14 2022, 12:59:47)  [GCC 9.4.0] (64-bit runtime)

Versions of relevant libraries:
[pip3] k2==1.23.3.dev20230105+cuda11.7.torch1.13.1
[pip3] numpy==1.22.2
[pip3] pytorch-quantization==2.1.2
[pip3] torch==1.13.1
[pip3] torch-tensorrt==1.3.0a0
[pip3] torchaudio==2.0.2
[pip3] torchtext==0.13.0a0+fae8e8c
[pip3] torchvision==0.15.0a0

Does it look like a race/sync issue?
Hopefully I will be able to post results from nvidia's compute-sanitizer.

gabor-pinter commented 1 year ago

Answering myself. No exhaustive testing was done, but when using matching CUDA (=11.7) and PyTorch CUDA(=11.7) versions (by using nvcr.io/nvidia/pytorch:22.08-py3 as base image), the problem seems to disappear. Again, no exhaustive testing was done yet.

danpovey commented 1 year ago

If it happens frequently enough it may be possible to find out which kernel was running when it crashed, by doing something like: nsys profile python3 ... by looking at the .qdrep file with NVidia NSight Systems, you may be able to see that. that could require a "debug" version of PyTorch, though.

On Sun, Jul 2, 2023 at 9:39 PM gabor-pinter @.***> wrote:

Answering myself. No exhaustive testing was done, but when using matching CUDA (=11.7) and PyTorch CUDA(=11.7) versions (by using nvcr.io/nvidia/pytorch:22.08-py3 as base image), the problem seems to disappear. Again, no exhaustive testing was done yet.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/sherpa/issues/416#issuecomment-1617286070, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO23CNTR2HFAEAL3WEDXOJEHTANCNFSM6AAAAAAZVDLPVM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

csukuangfj commented 1 year ago

The node I am using has 8 virtual nodes, Sherpa uses 11 threads. Sherpa seem to be active on 4 threads:

By the way, could you post the complete commands you are using? Also, did you change any code?

gabor-pinter commented 1 year ago

Hi Dan, Thanks for the comment on the profiler. Though I only used it on the "fixed" setup, nsys output is really informative, thanks for mentioning it. For the reader, a human-readable report can be generated with nsys stats report3.nsys-rep (where report3.nsys-rep is the nsys dump):

that could require a "debug" version of PyTorch, though.

Do you mean a static build of torch with debug symbols? I am slowly developing an itch to build torch in house - and probably there will be a point where we cannot avoid it.

gabor-pinter commented 1 year ago

Hi Csukuangfj,

Also, did you change any code?

Yes, we did some changes - but mainly around logging.

could you post the complete commands you are using?

Sure, let me go back to a version where I can reproduce the issue, and will post the command (hopefully with some insights from nsys).

danpovey commented 1 year ago

Torch has some kind of debug build option I think... IDK whether they distribute these via pip etc.

On Tue, Jul 4, 2023, 8:02 AM gabor-pinter @.***> wrote:

Hi Dan, Thanks for the comment on the profiler. Though I only used it on the "fixed" setup, nsys output is really informative, thanks for mentioning it. For the reader, a human-readable report can be generated with nsys stats report3.nsys-rep (where report3.nsys-rep is the nsys dump):

that could require a "debug" version of PyTorch, though.

Do you mean a static build of torch with debug symbols? I am slowly developing an itch to build torch in house - and probably there will be a point where we cannot avoid it.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/sherpa/issues/416#issuecomment-1619550184, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZE5UDZLWMAG5YF7YTXOOWXFANCNFSM6AAAAAAZVDLPVM . You are receiving this because you commented.Message ID: @.***>

gabor-pinter commented 1 year ago

When it comes to debug, I believe there are too many flags/options to consider for a release version.

gabor-pinter commented 1 year ago

An update: I tested the crash-y version in 3 conditions:

[1] running the binary directly

[2] nsys run

[3] compute-sanitizer run

@csukuangfj , here is the command to start the server:

/workspace/sherpa/build/temp.linux-x86_64-3.8/bin/sherpa-online-websocket-server \
     --port=7014 \
     --nn-model=${MDL_DIR}/cpu_jit.pt \
     --tokens=${MDL_DIR}//tokens.txt \
     --doc-root=$WEB_INDEX \
     --use-gpu=true \
     --sample-frequency=8000 \
     --num-work-threads=10 \
     --max-batch-size=400 \
     --decode-chunk-size=64
csukuangfj commented 1 year ago
RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.11.7.
  Make sure that libnvrtc-builtins.so.11.7 is installed correctly.

The error shows it cannot find the following file

/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7

Could you set

export LD_LIBRARY_PATH=/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib:$LD_LIBRARY_PATH

and see if this error goes away.

gabor-pinter commented 1 year ago

Hi @csukuangfj , Thanks for the hint. The modification of LD_LIBRARY_PATH worked. However after 2 runs, the server crashed. Since a few of us are using the same server, (1) not absolutely sure if this modification has to do anything with the crash, (2) will have to try to find some calm period when I can try again.

One thing I noticed though is that the python/dist-packages path preceded CUDA's compat library. My guess is that the compat lib supposed to come left-most in LD_LIBRARY_PATH.

danpovey commented 1 year ago

If the server just rebooted without anything in the logs, it's likely that the power supply was not sufficient and it tripped due to the GPUs being used too much.