triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

3rd Tritonserver fails to respond #509

Open njaramish opened 1 week ago

njaramish commented 1 week ago

System Info

8xH100 node, deploying each server inside its own Docker container

Who can help?

No response

Information

Tasks

Reproduction

Using TensorRT-LLM v0.10.0. Image is built from the tensorrtllm_backend repo using docker.

  1. Compile engines
    
    python convert_checkpoint.py --model_dir mistralai/Mixtral-8x7B-Instruct-v0.1 \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --tp_size 2

trtllm-build --checkpoint_dir {checkpoint_dir} \ --output_dir {deploy_dir} \ --gemm_plugin float16 \ --workers 2 \ --tp_size 2 \ --pp_size 1 \ --gpt_attention_plugin float16 \ --context_fmha enable \ --remove_input_padding enable \ --use_custom_all_reduce disable \ --paged_kv_cache enable \ --use_paged_context_fmha disable \ --max_input_len 32768 \ --max_batch_size 10 \ --max_output_len 1024 \ --max_beam_width 1 \ --max_num_tokens 65544

2. Launch servers, changing the ports and cards they use, similar to the commands below, but running each command in separate docker container which is started with access to a different set of 2 GPUs:

python3 scripts/launch_triton_server.py --world_size 2 --model_repo={deploy_dir} --http_port 8000 --grpc_port 8001 --metrics_port 8002 python3 scripts/launch_triton_server.py --world_size 2 --model_repo={deploy_dir} --http_port 8010 --grpc_port 8011 --metrics_port 8012 python3 scripts/launch_triton_server.py --world_size 2 --model_repo={deploy_dir} --http_port 8020 --grpc_port 8021 --metrics_port 8022

3. `curl localhos:8020/v2/health/ready`

### Expected behavior

Given the tutorial [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md#running-multiple-instances-of-llama-model-on-multiple-gpus), I thought that I would be able to run a number of Tritonservers with `tensorrtllm_backend` on a given node, provided that each Tritonserver has its own allocation of GPUs and ports, and that each server would be responsive to requests.

### actual behavior

Launching one container running one Tritonserver works as expected, as does launching a second container running another Tritonserver (on different ports and GPUs). However, launching a 3rd container with a 3rd Tritonserver results with the 3rd Tritonserver not being responsive:

curl localhost:8020/v2/health/ready curl: (56) Recv failure: Connection reset by peer



Sometimes, the request goes through and a healthy response is returned. Similarly, I am sometimes able to get a response from the ensemble model, but other times I get the connection reset error. 

The two Tritonservers that are spun up first always work. Starting with no containers or servers running, if I spin up the 1st and 3rd port-gpu configuration first, those two Tritonservers work, while the 2nd port-gpu configuration does not. 

### additional notes

I see that the example [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md#running-multiple-instances-of-llama-model-on-multiple-gpus) only launches two Tritonservers. Does the `tensorrtllm_backend` support launching many (i.e. more than 2) Tritonservers on the same node?