Open mindhash opened 6 months ago
reproduced on NGC 24.02 and 24.03 container when tried to load more than one engines on 8 A100 GPUs, by specifying:
instance_group [
{
count: 2
kind: KIND_CPU
}
]
in the tensorrt_llm/config.pbtxt
TRTLLM v0.8.0
built with tp_size 8
System Info
Environment: 2 NVIDIA A100 with nvlink Tensorrt-LLM Backend version v0.8.0 LLAMA2 engine built with paged_kv_cache and tp_size 2, world size 2 X86_64 arch
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Steps to reproduce:
huggingface-cli login --token <token>
python scripts/launch_triton_server.py --world_size 2 --model_repo=<engine_dir>
The server hangs forever. I have waiting upto 30 mins without any response.
Expected behavior
The models should be loaded and server should go to a ready state (display HTTP endpoints) OR show errors in the log if there are any.
actual behavior
The server hangs forever. I have waiting upto 30 mins without any response.
additional notes
Looking for further direction to debug this. The log does not show current activity in progress, neither error message.
Log: