triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

Error if Triton Binary is started early #16

Closed jimwu6 closed 2 years ago

jimwu6 commented 2 years ago

I've run into a situation where I will get this error.

...
W0401 23:09:32.708642 1 libfastertransformer.cc:648] input: OUTPUT1, type: TYPE_FP32, shape: [63, 63]
terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: the provided PTX was compiled with an unsupported toolchain. /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/cuda_utils.h:314

[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] *** Process received signal ***
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] Signal: Aborted (6)
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] Signal code:  (-6)
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ff1854a83c0]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff184c3018b]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ff184c0f859]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff184fe6911]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff184ff238c]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff184ff23f7]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff184ff26a9]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 7] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x27d99)[0x7ff120d9dd99]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 8] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x119c5)[0x7ff120d879c5]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 9] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x11e66)[0x7ff120d87e66]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [10] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x2740a)[0x7ff120d9d40a]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [11] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7ff18501ede4]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7ff18549c609]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff184d0c293]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] *** End of error message ***
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:1    :0:32] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
Before Loading Model:
==== backtrace (tid:     32) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7ff022b43824]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x2b9ff) [0x7ff022b439ff]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x2bd34) [0x7ff022b43d34]
 3  /lib/x86_64-linux-gnu/libc.so.6(abort+0x213) [0x7ff184c0f941]
 4  /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911) [0x7ff184fe6911]
 5  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c) [0x7ff184ff238c]
 6  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7) [0x7ff184ff23f7]
 7  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9) [0x7ff184ff26a9]
 8  /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x27d99) [0x7ff120d9dd99]
 9  /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x119c5) [0x7ff120d879c5]
10  /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x11e66) [0x7ff120d87e66]
11  /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x2740a) [0x7ff120d9d40a]
12  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4) [0x7ff18501ede4]
13  /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7ff18549c609]
14  /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff184d0c293]
=================================
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] *** Process received signal ***
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] Signal: Segmentation fault (11)
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] Signal code:  (-6)
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] Failing at address: 0x1
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ff1854a83c0]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 1] /lib/x86_64-linux-gnu/libc.so.6(abort+0x213)[0x7ff184c0f941]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 2] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff184fe6911]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff184ff238c]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff184ff23f7]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff184ff26a9]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 6] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x27d99)[0x7ff120d9dd99]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 7] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x119c5)[0x7ff120d879c5]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 8] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x11e66)[0x7ff120d87e66]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 9] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x2740a)[0x7ff120d9d40a]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [10] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7ff18501ede4]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [11] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7ff18549c609]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [12] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff184d0c293]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] *** End of error message ***

To reproduce this I first built the image as described in the README on the dev/v1.1_beta branch with Triton version 21.08. Then I run the container as such

docker run -it --rm -d --gpus=4 -p8000:8000 -p8001:8001 -p8002:8002 \
    -v ${TRITON_MODELS_STORE}:/model-store:ro \
    -v ${WORKSPACE}:/ft_workspace \
    --name ft \
    ${TRITON_DEV_DOCKER_IMAGE} \
    /bin/bash

and then go into the container with docker exec -it ft /bin/bash.

Finally I run the binary to start the server

mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/model-store

If I start the server very soon after the container is started (likely somewhere < 5s), I will get this error. To solve this locally, I run the server binary again, and it works.

However, if after the container starts, I wait some more time (10s seems to work reliably) before running the binary, this error will not appear.

Searching on the internet this seems like some mismatch of Nvidia drivers (e.g. CUDA) but I find this weird because the drivers should be contained on the Docker image anyways.

I can reproduce this on A100s and T4s as well.

byshiue commented 2 years ago

The error is here https://github.com/NVIDIA/FasterTransformer/blob/dev/v5.0_beta/src/fastertransformer/utils/cuda_utils.h#L314. Namely, caused by cuda function directly. I think the problem is like you say, the mismatch of driver. Although we have driver in docker, you still need to install matched driver outside the docker.

byshiue commented 2 years ago

Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.