100% GPU 'freeze' with Zipformer

gabor-pinter commented 1 year ago

We are having an issue using using zipformer with multiple worker threads, looking like a livelock/busy deadlock situation:

GPU utilization at 100% (jumping all of a sudden from cca 30% to 100%)
no further processing is done

Further notes:

less likely to happen with lower number of working threads
less likely to happen with lower batch sizes
does not happen with pruned transducer model
does not happen during CPU-only computation (ohne GPU)

I am attaching a trace log captured during a deadlock The node I am using has 8 virtual nodes, Sherpa uses 11 threads. Sherpa seem to be active on 4 threads:

#1  sherpa::OnlineZipformerTransducerModel::GetEncoderInitStates(...)
#11 sherpa::OnlineRecognizer::OnlineRecognizerImpl::DecodeStreams(...)
#15 sherpa::OnlineTransducerGreedySearchDecoder::Decode(...)
#18 sherpa::OnlineZipformerTransducerModel::RunEncoder(...)

Environment:

Nvidia driver version: 510.108.03
CUDA runtime version: 11.8.89
PyTorch version: 1.13.1+cu117
CUDA used to build PyTorch: 11.7
Is debug build: False
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CMake version: version 3.24.1
Libc version: glibc-2.31
Python version: 3.8.10 (default, Nov 14 2022, 12:59:47)  [GCC 9.4.0] (64-bit runtime)

Versions of relevant libraries:
[pip3] k2==1.23.3.dev20230105+cuda11.7.torch1.13.1
[pip3] numpy==1.22.2
[pip3] pytorch-quantization==2.1.2
[pip3] torch==1.13.1
[pip3] torch-tensorrt==1.3.0a0
[pip3] torchaudio==2.0.2
[pip3] torchtext==0.13.0a0+fae8e8c
[pip3] torchvision==0.15.0a0

Does it look like a race/sync issue?
Hopefully I will be able to post results from nvidia's compute-sanitizer.

gabor-pinter commented 1 year ago

Answering myself. No exhaustive testing was done, but when using matching CUDA (=11.7) and PyTorch CUDA(=11.7) versions (by using nvcr.io/nvidia/pytorch:22.08-py3 as base image), the problem seems to disappear. Again, no exhaustive testing was done yet.

danpovey commented 1 year ago

If it happens frequently enough it may be possible to find out which kernel was running when it crashed, by doing something like: nsys profile python3 ... by looking at the .qdrep file with NVidia NSight Systems, you may be able to see that. that could require a "debug" version of PyTorch, though.

On Sun, Jul 2, 2023 at 9:39 PM gabor-pinter @.***> wrote:

Answering myself. No exhaustive testing was done, but when using matching CUDA (=11.7) and PyTorch CUDA(=11.7) versions (by using nvcr.io/nvidia/pytorch:22.08-py3 as base image), the problem seems to disappear. Again, no exhaustive testing was done yet.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/sherpa/issues/416#issuecomment-1617286070, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO23CNTR2HFAEAL3WEDXOJEHTANCNFSM6AAAAAAZVDLPVM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

csukuangfj commented 1 year ago

The node I am using has 8 virtual nodes, Sherpa uses 11 threads. Sherpa seem to be active on 4 threads:

By the way, could you post the complete commands you are using? Also, did you change any code?

gabor-pinter commented 1 year ago

Hi Dan, Thanks for the comment on the profiler. Though I only used it on the "fixed" setup, nsys output is really informative, thanks for mentioning it. For the reader, a human-readable report can be generated with nsys stats report3.nsys-rep (where report3.nsys-rep is the nsys dump):

that could require a "debug" version of PyTorch, though.

Do you mean a static build of torch with debug symbols? I am slowly developing an itch to build torch in house - and probably there will be a point where we cannot avoid it.

gabor-pinter commented 1 year ago

Hi Csukuangfj,

Also, did you change any code?

Yes, we did some changes - but mainly around logging.

could you post the complete commands you are using?

Sure, let me go back to a version where I can reproduce the issue, and will post the command (hopefully with some insights from nsys).

danpovey commented 1 year ago

Torch has some kind of debug build option I think... IDK whether they distribute these via pip etc.

On Tue, Jul 4, 2023, 8:02 AM gabor-pinter @.***> wrote:

Hi Dan, Thanks for the comment on the profiler. Though I only used it on the "fixed" setup, nsys output is really informative, thanks for mentioning it. For the reader, a human-readable report can be generated with nsys stats report3.nsys-rep (where report3.nsys-rep is the nsys dump):

that could require a "debug" version of PyTorch, though.

Do you mean a static build of torch with debug symbols? I am slowly developing an itch to build torch in house - and probably there will be a point where we cannot avoid it.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/sherpa/issues/416#issuecomment-1619550184, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZE5UDZLWMAG5YF7YTXOOWXFANCNFSM6AAAAAAZVDLPVM . You are receiving this because you commented.Message ID: @.***>

gabor-pinter commented 1 year ago

When it comes to debug, I believe there are too many flags/options to consider for a release version.

gabor-pinter commented 1 year ago

An update: I tested the crash-y version in 3 conditions:

[1] running the binary directly

sometimes completes
sometimes 100% GPU freeze

[2] nsys run

no freezes

[3] compute-sanitizer run

crashes with: Make sure that libnvrtc-builtins.so.11.7 is installed **correctly.

terminate called after throwing an instance of 'std::runtime_error'
what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/zipformer.py", line 67, in forward
... [omitted]
RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.11.7.
Make sure that libnvrtc-builtins.so.11.7 is installed correctly.

/usr/local/cuda/targets/x86_64-linux/lib/libnvrtc-builtins.so.11.8.89
/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7

it seems that during the nsys run the system and dist-package versions of libnvrtc-builtins are mixed up
not sure if this is the reason for the 100% GPU though ( it is possible that originally it is a plain old out-of-memory error, during which the above quoted ABI incompatibility error happens)

@csukuangfj , here is the command to start the server:

/workspace/sherpa/build/temp.linux-x86_64-3.8/bin/sherpa-online-websocket-server \
     --port=7014 \
     --nn-model=${MDL_DIR}/cpu_jit.pt \
     --tokens=${MDL_DIR}//tokens.txt \
     --doc-root=$WEB_INDEX \
     --use-gpu=true \
     --sample-frequency=8000 \
     --num-work-threads=10 \
     --max-batch-size=400 \
     --decode-chunk-size=64

csukuangfj commented 1 year ago

RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.11.7.
  Make sure that libnvrtc-builtins.so.11.7 is installed correctly.

The error shows it cannot find the following file

/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7

Could you set

export LD_LIBRARY_PATH=/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib:$LD_LIBRARY_PATH

and see if this error goes away.

gabor-pinter commented 1 year ago

Hi @csukuangfj , Thanks for the hint. The modification of LD_LIBRARY_PATH worked. However after 2 runs, the server crashed. Since a few of us are using the same server, (1) not absolutely sure if this modification has to do anything with the crash, (2) will have to try to find some calm period when I can try again.

One thing I noticed though is that the python/dist-packages path preceded CUDA's compat library. My guess is that the compat lib supposed to come left-most in LD_LIBRARY_PATH.

danpovey commented 1 year ago

If the server just rebooted without anything in the logs, it's likely that the power supply was not sufficient and it tripped due to the GPUs being used too much.

k2-fsa / sherpa

100% GPU 'freeze' with Zipformer #416