vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
21.93k stars 3.1k forks source link

Limited Request Handling for AMD Instinct MI300 X GPUs with Tensor Parallelism > 1 #2988

Open Spurthi-Bhat-ScalersAI opened 4 months ago

Spurthi-Bhat-ScalersAI commented 4 months ago

Reproducing steps:

  1. Clone the vllm repo and switch to tag v0.3.1

  2. Build the Dockerfile.rocm dockerfile with instructions from Option 3: Build from source with docker -Installation with ROCm

    build command:

    docker build  -f Dockerfile.rocm -t vllm-rocm .
  3. The vLLM serving command used:

    python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-70b-chat-hf --dtype float16 --tensor-parallel-size 8
  4. Used Apache Bench for testing with 256 concurrent requests

The error below:

INFO 02-21 10:31:34 metrics.py:161] Avg prompt throughput: 352.5 tokens/s, Avg generation throughput: 55.2 tokens/s, Running: 67 reqs, Swapped: 0 reqs, Pending: 130 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%
Memory access fault by GPU node-2 (Agent handle: 0x9f73a80) on address 0x7ef9eb704000. Reason: Unknown.
*** SIGABRT received at time=1708511498 on cpu 37 ***
PC: @     0x7f0f63c8400b  (unknown)  raise
    @     0x7f0f63fa1420       4224  (unknown)
    @     0x7f0e76ca147c  (unknown)  (unknown)
[2024-02-21 10:31:38,596 E 725390 741603] logging.cc:361: *** SIGABRT received at time=1708511498 on cpu 37 ***
[2024-02-21 10:31:38,596 E 725390 741603] logging.cc:361: PC: @     0x7f0f63c8400b  (unknown)  raise
[2024-02-21 10:31:38,596 E 725390 741603] logging.cc:361:     @     0x7f0f63fa1420       4224  (unknown)
[2024-02-21 10:31:38,596 E 725390 741603] logging.cc:361:     @     0x7f0e76ca147c  (unknown)  (unknown)
Fatal Python error: Aborted

Aborted (core dumped)

Issues:

  1. The above issue happens whenever the tensor parallel size is set to more than 1 for Llama 2 70 B model.
  2. The maximum number of concurrent requests the vLLM serving can handle before the error occurs is just 5 requests.
Spurthi-Bhat-ScalersAI commented 4 months ago

I have found similar issue posted #2942

Spurthi-Bhat-ScalersAI commented 4 months ago

I have found a work around for the issue by enabling the enforce-eager flag

tom-papatheodore commented 4 months ago

Good to know about enforce-eager as a workaround @Spurthi-Bhat-ScalersAI Someone on the Discord server also pointed out that the vLLM docs state that only Mistral and Mixtral models are supported. Yet you seem to be running Llama2 like I was attempting. I'm not sure if LLama2 just happens to work or if perhaps the docs are just out of date. Regardless, thanks for following up on this with a solution!

Spurthi-Bhat-ScalersAI commented 4 months ago

Is the enforce-eager flag recommended? I faced the same issue while running falcon-180B as well. Are there any other solutions?

hongxiayang commented 3 months ago

yes, currently, --enforce-eager is recommended. We are working on to enable hipgraph mode to improve performance, but for now, please use --enforce-eager flag. Thanks. This is also documented. image

Spurthi-Bhat-ScalersAI commented 3 months ago

Thanks for your response @hongxiayang !