Incorrect completions with tensor parallel size of 8 on MI300X GPUs

seungduk-yanolja commented 7 months ago

I'm encountering an issue where vLLM fails to generate complete or sensible responses when the tensor parallel size is set to 8 on MI300X GPUs. Completions work as expected with tensor parallel sizes of 1 and 4.

Expected behavior:

vLLM should generate a correct and meaningful completion for the given prompt, similar to its behavior with tensor parallel sizes of 1 and 4.

Actual behavior:

vLLM provides an incomplete or nonsensical response, often similar to the following:

    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " <"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 96,
        "total_tokens": 99,
        "completion_tokens": 3
    }

System information:

OS: Linux test 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
vLLM version: I am using the default docker for ROCm specified here: https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#option-3-build-from-source-with-docker

apt show rocm-libs -a
Package: rocm-libs
Version: 6.0.0.60000-91~20.04
Status: install ok installed
Priority: optional
Section: devel
Maintainer: ROCm Dev Support <rocm-dev.support@amd.com>
Installed-Size: 13.3 kB
Depends: hipblas (= 2.0.0.60000-91~20.04), hipblaslt (= 0.6.0.60000-91~20.04), hipfft (= 1.0.12.60000-91~20.04), hipsolver (= 2.0.0.60000-91~20.04), hipsparse (= 3.0.0.60000-91~20.04), hiptensor (= 1.1.0.60000-91~20.04), miopen-hip (= 3.00.0.60000-91~20.04), half (= 1.12.0.60000-91~20.04), rccl (= 2.18.3.60000-91~20.04), rocalution (= 3.0.3.60000-91~20.04), rocblas (= 4.0.0.60000-91~20.04), rocfft (= 1.0.23.60000-91~20.04), rocrand (= 2.10.17.60000-91~20.04), hiprand (= 2.10.16.60000-91~20.04), rocsolver (= 3.24.0.60000-91~20.04), rocsparse (= 3.0.2.60000-91~20.04), rocm-core (= 6.0.0.60000-91~20.04), composablekernel-dev (= 1.1.0.60000-91~20.04), hipblas-dev (= 2.0.0.60000-91~20.04), hipblaslt-dev (= 0.6.0.60000-91~20.04), hipcub-dev (= 3.0.0.60000-91~20.04), hipfft-dev (= 1.0.12.60000-91~20.04), hipsolver-dev (= 2.0.0.60000-91~20.04), hipsparse-dev (= 3.0.0.60000-91~20.04), hiptensor-dev (= 1.1.0.60000-91~20.04), miopen-hip-dev (= 3.00.0.60000-91~20.04), rccl-dev (= 2.18.3.60000-91~20.04), rocalution-dev (= 3.0.3.60000-91~20.04), rocblas-dev (= 4.0.0.60000-91~20.04), rocfft-dev (= 1.0.23.60000-91~20.04), rocprim-dev (= 3.0.0.60000-91~20.04), rocrand-dev (= 2.10.17.60000-91~20.04), hiprand-dev (= 2.10.16.60000-91~20.04), rocsolver-dev (= 3.24.0.60000-91~20.04), rocsparse-dev (= 3.0.2.60000-91~20.04), rocthrust-dev (= 3.0.0.60000-91~20.04), rocwmma-dev (= 1.3.0.60000-91~20.04)
Homepage: https://github.com/RadeonOpenCompute/ROCm
Download-Size: unknown
APT-Manual-Installed: yes
APT-Sources: /var/lib/dpkg/status
Description: Radeon Open Compute (ROCm) Runtime software stack

hliuca commented 7 months ago

build RCCL using newer version (or latest) and dynamic link (LD_LIBRARY_PATH)?

hongxiayang commented 2 months ago

@seungduk-yanolja Using the older version (ROCm 6.0) you might need --enforce-eager for multi-gpu case. In the current main branch (ROCm 6.1.x with other patchs), this should be fine with the default graph mode. Please test again and update the issue if this still happens to you.

hongxiayang commented 3 weeks ago

Closing this issue. If you see any new issue with the current main branch, please open it again

vllm-project / vllm

Incorrect completions with tensor parallel size of 8 on MI300X GPUs #2817