TensorRT-LLM pipeline parallelism is broken

asesorov commented 2 weeks ago

Problem Description

When trying to use pipeline parallelism in tensorrt-llm on 2+ NVIDIA GPUs, I encounter AssertionError: Expected but not provided tensors:{'transformer.vocab_embedding.weight'}. I tried other models, but error is the same.

Environment

GPUs: 2xNVIDIA RTX 4090
Docker: optimum-nvidia (0.1.0b7, latest available)
Optimum Benchmark version: 0.4.0

Optimum Benchmark configuration

defaults:
  - benchmark
  - backend: tensorrt-llm
  - scenario: inference
  - launcher: process
  - _self_

name: trt_llama

launcher:
  device_isolation: true
  device_isolation_action: warn

backend:
  device: cuda
  dtype: bfloat16
  device_ids: 0,1
  max_prompt_length: 1024
  max_batch_size: 16
  max_new_tokens: 100
  tp: 1
  pp: 2
  world_size: 2
  gpus_per_node: 2
  model: IlyaGusev/saiga_llama3_8b

scenario:
  latency: true
  memory: false
  energy: false
  input_shapes:
    batch_size: 1
    sequence_length: 128
  generate_kwargs:
    max_new_tokens: 100
    min_new_tokens: 100

Logs

With mpirun: trt-llm_2gpus_pp_mpirun_n2.log

Without mpirun: trt-llm_2gpus_pp.log

Preview of the error:

AssertionError: Expected but not provided tensors:{'transformer.vocab_embedding.weight'}

IlyasMoutawwakil commented 1 week ago

Hello, does this same configuration work for you outside of the context of optimum-benchmark ? Also how did you launch your benchmarks ? You mentioned mpirun but I'm not sure that's needed to run distributed trt-llm.

asesorov commented 1 week ago

@IlyasMoutawwakil when running without mpi, I get RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp * pp (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:90)

Here's the sample configuration I use:

defaults:
  - benchmark
  - backend: tensorrt-llm
  - scenario: inference
  - launcher: process
  - _self_

name: trt_llama

launcher:
  device_isolation: true
  device_isolation_action: warn

backend:
  device: cuda
  dtype: bfloat16
  device_ids: 0,1
  max_prompt_length: 1024
  max_batch_size: 16
  tp: 2
  pp: 1
  world_size: 2
  gpus_per_node: 2
  model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

scenario:
  latency: true
  memory: false
  energy: false
  input_shapes:
    batch_size: 1
    sequence_length: 128
  generate_kwargs:
    max_new_tokens: 100
    min_new_tokens: 100

And command line which successfully launches the benchmark: mpirun -n 2 --allow-run-as-root optimum-benchmark --config-dir /mnt/host --config-name trt_llama_2gp us. However, without mpi I'm getting the mpiSize == tp * pp assertion error. Please, tell me if I'm doing something wrong. Thank you in advance.

IlyasMoutawwakil commented 1 week ago

will investigate this, I remember launching distributed (tp) trt-llm without mpirun, but it's been long now.

IlyasMoutawwakil commented 1 week ago

I was able to run trt-llm with tp and pp without the mpirun runner, I believe that's only needed for multi-node. Both configs are being tested as part of the CI with TinyLllama.

asesorov commented 1 week ago

Very strange - I tried now to reproduce CLI tests on my machine using optimum-nvidia:latest container, and still got the same error: test-cli:logging_utils.py:63 RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp * pp (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:90)

In logs, I see that world_size is indeed 1: [PYTEST-PROCESS][2024-09-24 07:23:01][test-cli][INFO] - [TensorRT-LLM][INFO] MPI size: 1, rank: 0

Here are my steps:

docker run -it --rm --gpus all -e HF_HOME=/mnt/storage -e CUDA_VISIBLE_DEVICES=0,1 --name optimum-nvidia docker.io/huggingface/optimum-nvidia:latest
pip install optimum-benchmark[tensorrt-llm]
git clone https://github.com/huggingface/optimum-benchmark.git && cd optimum-benchmark/
pip install -e .[testing]
FORCE_SEQUENTIAL=1 pytest tests/test_cli.py -x -s -k "cli and cuda and tensorrt_llm and (tp or pp)"

asesorov commented 1 week ago

Sorry, I double-checked the logs and figured out that I'm using pre-built engines from single-GPU runs 🤦‍♂️ Nevertheless, I still see this line after successfult run, however: [TensorRT-LLM][INFO] MPI size: 1, rank: 0 And in nvidia-smi I see that only 1 of 2 GPUs is used during CLI tests.

asesorov commented 1 week ago

Also, I see this in the GitHub CI log (e.g. https://github.com/huggingface/optimum-benchmark/actions/runs/11008321942/job/30565746560):

[PYTEST-PROCESS][2024-09-24 06:41:41][test-cli][INFO] - [TensorRT-LLM][INFO] MPI size: 1, rank: 0
Warning: ROCESS][2024-09-24 06:41:42][test-cli][INFO] - [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.

IlyasMoutawwakil commented 1 week ago

In my "local" tests (on an A100) I see equal usage on both GPUs, until kv cache starts being allocated and that's when one machine uses more than the other (almost gets saturated) I guess that's weird but it sounds like an issue in tensorrt-llm. I also don't get [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available. locally, this is an issue with the communication topology as explained in https://github.com/NVIDIA/TensorRT-LLM/issues/1487#issuecomment-2074214678, I'm running "locally" on a DGX machine with SXM4 so it makes sense to support p2p.

I also checked optimum-nvidia code and it's using the LLM helper class at: https://github.com/huggingface/optimum-nvidia/blob/main/src/optimum/nvidia/runtime.py this API uses MPIPoolSession when mpirun is not used to launch https://github.com/NVIDIA/TensorRT-LLM/blob/a65dba7aaf7e2d8bb0120eea8f8f04deff145d6a/tensorrt_llm/hlapi/llm.py#L126-L132 this class is better documented in the examples https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llm-api

IlyasMoutawwakil commented 1 week ago

tell me if this makes sense, I admit it is weird and confusing that the logs show MPI size as 1.

asesorov commented 1 week ago

Tried on another machine with different GPUs, and still see the same usage (one GPU is used - and, as you said, almost saturated, another is idle):

Additionally, the metrics when using single-GPU or TP in 2 GPUs are identical (in case of 4090 and tinyllama, throughput is always around 350 tokens/s). Indeed it seems like a trtllm issue. Can you tell me if it is possible to smoothly upgrade the TensorRT-LLM from 0.9.0dev (used in optimum-nvidia image) to newer version to try it? Also, when I used mpirun I (expectedly) saw double throughput results which were a bit different - is it correct to sum these results to get the correct throughput? And thank you for your help.

IlyasMoutawwakil commented 1 week ago

No it's actually wrong to sum throughputs with TP or PP, these two strategies split the model and not the data, so in the case of TP tensors are split, and only half of the computation is performed on each GPU, but you can't have different inputs on each process (unlike DP). That's why batch_size=1 works with TP and PP, but the min batch size with DP is 2.

It makes sense for me that TP gives as much perf as single gpu here, in fact I'm surprised it reaches that, as it's a strategy that's optimized for compute bound problems (big weights + prefill = big matmuls) with a bit of comm overhead.

IlyasMoutawwakil commented 1 week ago

@asesorov I can also easily implement an MPIrun launcher to verify these results. Will ping you in a PR.

huggingface / optimum-benchmark