How to analyze large models like llama 3 70B that requires model parallelism?

triton-inference-server / model_analyzer

Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Server models.

Apache License 2.0

404 stars 75 forks source link

How to analyze large models like llama 3 70B that requires model parallelism? #907

Open ccchow opened 3 weeks ago

ccchow commented 3 weeks ago

The model engine is built from llama 3 70b with tensor parallelism tp=2 and pp=2 and deployed by below triton launch script: python3 scripts/launch_triton_server.py --world_size 4 --model_repo=llama_ifb

In this case, how to leverage model-analyzer to analyze this parallelized model/deployment?

nv-braf commented 1 week ago

Are you able to run this model on PA or GenAI-Perf?

ccchow commented 1 week ago

I was able to use perf_analyzer to instrument llama 3 70b on 4*A100 (trtllm backend) via a launched triton server as below

python3 scripts/launch_triton_server.py --world_size 4 --model_repo=llama_ifb/
perf_analyzer -m ensemble --measurement-interval 10000 --concurrency-range <start:end:step> --input-data input.json

I'm wondering how can I tune triton model config using model analyzer in this case.

Thanks.

LanceB57 commented 1 week ago

I'm in a very similar predicament, but with 8*H100. I'm getting pretty underwhelming results and would also like to know how to utilize model-analyzer, as I'm fairly new to Triton.