Open ccchow opened 3 weeks ago
Are you able to run this model on PA or GenAI-Perf?
I was able to use perf_analyzer to instrument llama 3 70b on 4*A100 (trtllm backend) via a launched triton server as below
python3 scripts/launch_triton_server.py --world_size 4 --model_repo=llama_ifb/
perf_analyzer -m ensemble --measurement-interval 10000 --concurrency-range <start:end:step> --input-data input.json
I'm wondering how can I tune triton model config using model analyzer in this case.
Thanks.
I'm in a very similar predicament, but with 8*H100. I'm getting pretty underwhelming results and would also like to know how to utilize model-analyzer, as I'm fairly new to Triton.
The model engine is built from llama 3 70b with tensor parallelism tp=2 and pp=2 and deployed by below triton launch script: python3 scripts/launch_triton_server.py --world_size 4 --model_repo=llama_ifb
In this case, how to leverage model-analyzer to analyze this parallelized model/deployment?