Open asesorov opened 2 weeks ago
Hello, does this same configuration work for you outside of the context of optimum-benchmark
?
Also how did you launch your benchmarks ? You mentioned mpirun
but I'm not sure that's needed to run distributed trt-llm.
@IlyasMoutawwakil when running without mpi, I get RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp * pp (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:90)
Here's the sample configuration I use:
defaults:
- benchmark
- backend: tensorrt-llm
- scenario: inference
- launcher: process
- _self_
name: trt_llama
launcher:
device_isolation: true
device_isolation_action: warn
backend:
device: cuda
dtype: bfloat16
device_ids: 0,1
max_prompt_length: 1024
max_batch_size: 16
tp: 2
pp: 1
world_size: 2
gpus_per_node: 2
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
scenario:
latency: true
memory: false
energy: false
input_shapes:
batch_size: 1
sequence_length: 128
generate_kwargs:
max_new_tokens: 100
min_new_tokens: 100
And command line which successfully launches the benchmark: mpirun -n 2 --allow-run-as-root optimum-benchmark --config-dir /mnt/host --config-name trt_llama_2gp us
. However, without mpi I'm getting the mpiSize == tp * pp
assertion error. Please, tell me if I'm doing something wrong. Thank you in advance.
will investigate this, I remember launching distributed (tp) trt-llm without mpirun, but it's been long now.
I was able to run trt-llm with tp and pp without the mpirun runner, I believe that's only needed for multi-node. Both configs are being tested as part of the CI with TinyLllama.
Very strange - I tried now to reproduce CLI tests on my machine using optimum-nvidia:latest container, and still got the same error:
test-cli:logging_utils.py:63 RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp * pp (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:90)
In logs, I see that world_size is indeed 1:
[PYTEST-PROCESS][2024-09-24 07:23:01][test-cli][INFO] - [TensorRT-LLM][INFO] MPI size: 1, rank: 0
Here are my steps:
Sorry, I double-checked the logs and figured out that I'm using pre-built engines from single-GPU runs 🤦♂️
Nevertheless, I still see this line after successfult run, however: [TensorRT-LLM][INFO] MPI size: 1, rank: 0
And in nvidia-smi
I see that only 1 of 2 GPUs is used during CLI tests.
Also, I see this in the GitHub CI log (e.g. https://github.com/huggingface/optimum-benchmark/actions/runs/11008321942/job/30565746560):
[PYTEST-PROCESS][2024-09-24 06:41:41][test-cli][INFO] - [TensorRT-LLM][INFO] MPI size: 1, rank: 0
Warning: ROCESS][2024-09-24 06:41:42][test-cli][INFO] - [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
In my "local" tests (on an A100) I see equal usage on both GPUs, until kv cache starts being allocated and that's when one machine uses more than the other (almost gets saturated) I guess that's weird but it sounds like an issue in tensorrt-llm. I also don't get [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
locally, this is an issue with the communication topology as explained in https://github.com/NVIDIA/TensorRT-LLM/issues/1487#issuecomment-2074214678, I'm running "locally" on a DGX machine with SXM4 so it makes sense to support p2p.
I also checked optimum-nvidia code and it's using the LLM helper class at: https://github.com/huggingface/optimum-nvidia/blob/main/src/optimum/nvidia/runtime.py this API uses MPIPoolSession when mpirun is not used to launch https://github.com/NVIDIA/TensorRT-LLM/blob/a65dba7aaf7e2d8bb0120eea8f8f04deff145d6a/tensorrt_llm/hlapi/llm.py#L126-L132 this class is better documented in the examples https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llm-api
tell me if this makes sense, I admit it is weird and confusing that the logs show MPI size as 1.
Tried on another machine with different GPUs, and still see the same usage (one GPU is used - and, as you said, almost saturated, another is idle):
Additionally, the metrics when using single-GPU or TP in 2 GPUs are identical (in case of 4090 and tinyllama, throughput is always around 350 tokens/s). Indeed it seems like a trtllm issue. Can you tell me if it is possible to smoothly upgrade the TensorRT-LLM from 0.9.0dev (used in optimum-nvidia image) to newer version to try it? Also, when I used mpirun I (expectedly) saw double throughput results which were a bit different - is it correct to sum these results to get the correct throughput? And thank you for your help.
No it's actually wrong to sum throughputs with TP or PP, these two strategies split the model and not the data, so in the case of TP tensors are split, and only half of the computation is performed on each GPU, but you can't have different inputs on each process (unlike DP). That's why batch_size=1 works with TP and PP, but the min batch size with DP is 2.
It makes sense for me that TP gives as much perf as single gpu here, in fact I'm surprised it reaches that, as it's a strategy that's optimized for compute bound problems (big weights + prefill = big matmuls) with a bit of comm overhead.
@asesorov I can also easily implement an MPIrun
launcher to verify these results. Will ping you in a PR.
Problem Description
When trying to use pipeline parallelism in tensorrt-llm on 2+ NVIDIA GPUs, I encounter
AssertionError: Expected but not provided tensors:{'transformer.vocab_embedding.weight'}
. I tried other models, but error is the same.Environment
Optimum Benchmark configuration
Logs
With mpirun: trt-llm_2gpus_pp_mpirun_n2.log
Without mpirun: trt-llm_2gpus_pp.log
Preview of the error: