I'm running triton inference server on a server with 4 GPUs (no pipeline parallelism). Following the GPT guide, I can run inference with tensor parallelism = 2 (so only using 2 of the GPUs). However, if I follow the same steps but instead run with 4 GPUs in tensor parallelism, any single inference I run freezes similar to in https://github.com/triton-inference-server/fastertransformer_backend/issues/19, but when I'm running with NCCL_LAUNCH_MODE is GROUP or PARALLEL.
The GPUs will also show full utilization (according to nvidia-smi) until I kill the container, potentially even hours later, long after the timeout window. The server doesn't think any requests are in flight at the time for any models though.
logs:
W0706 20:19:47.394019 877 libfastertransformer.cc:999] before ThreadForward 0
W0706 20:19:47.394158 877 libfastertransformer.cc:1006] after ThreadForward 0
W0706 20:19:47.394177 877 libfastertransformer.cc:999] before ThreadForward 1
W0706 20:19:47.394287 877 libfastertransformer.cc:1006] after ThreadForward 1
W0706 20:19:47.394303 877 libfastertransformer.cc:999] before ThreadForward 2
I0706 20:19:47.394317 877 libfastertransformer.cc:834] Start to forward
I0706 20:19:47.394388 877 libfastertransformer.cc:834] Start to forward
W0706 20:19:47.394424 877 libfastertransformer.cc:1006] after ThreadForward 2
W0706 20:19:47.394444 877 libfastertransformer.cc:999] before ThreadForward 3
I0706 20:19:47.394530 877 libfastertransformer.cc:834] Start to forward
W0706 20:19:47.394565 877 libfastertransformer.cc:1006] after ThreadForward 3
I0706 20:19:47.394651 877 libfastertransformer.cc:834] Start to forward
I'm running triton inference server on a server with 4 GPUs (no pipeline parallelism). Following the GPT guide, I can run inference with tensor parallelism = 2 (so only using 2 of the GPUs). However, if I follow the same steps but instead run with 4 GPUs in tensor parallelism, any single inference I run freezes similar to in https://github.com/triton-inference-server/fastertransformer_backend/issues/19, but when I'm running with
NCCL_LAUNCH_MODE
isGROUP
orPARALLEL
.The GPUs will also show full utilization (according to
nvidia-smi
) until I kill the container, potentially even hours later, long after the timeout window. The server doesn't think any requests are in flight at the time for any models though.logs: