vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.92k stars 4.52k forks source link

[Bug]: Triton assertion errors serving Llama-3.1-8b on 4xH100s in FP32 precision #8579

Closed pgimenes closed 1 month ago

pgimenes commented 1 month ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8469C CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 Stepping: 6 Frequency boost: enabled CPU max MHz: 2601.0000 CPU min MHz: 800.0000 BogoMIPS: 5200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 4.5 MiB (96 instances) L1i cache: 3 MiB (96 instances) L2 cache: 192 MiB (96 instances) L3 cache: 195 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.68 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.44.2 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.1.post2@9ba0817ff1eb514f51cc6de9cb8e16c98d6ee44f vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV6 NV6 NV6 PIX NODE NODE SYS SYS SYS 0-47,96-143 0N/A GPU1 NV6 X NV6 NV6 NODE PIX PIX SYS SYS SYS 0-47,96-143 0N/A GPU2 NV6 NV6 X NV6 SYS SYS SYS PIX PIX NODE 48-95,144-191 1N/A GPU3 NV6 NV6 NV6 X SYS SYS SYS NODE NODE PIX 48-95,144-191 1N/A NIC0 PIX NODE SYS SYS X NODE NODE SYS SYS SYS NIC1 NODE PIX SYS SYS NODE X PIX SYS SYS SYS NIC2 NODE PIX SYS SYS NODE PIX X SYS SYS SYS NIC3 SYS SYS PIX NODE SYS SYS SYS X PIX NODE NIC4 SYS SYS PIX NODE SYS SYS SYS PIX X NODE NIC5 SYS SYS NODE PIX SYS SYS SYS NODE NODE X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 ```

Model Input Dumps

No response

🐛 Describe the bug

I'm trying to serve Llama-3.1-8b with model parallelism on 4 H100 GPUs with FP32 precision, and get the following Triton assertion error.

python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

This only seems to happen for FP32, but not FP16, BFP16 or FP8. This can be reproduced with the following script.

Server side:

vllm serve \
    meta-llama/Meta-Llama-3.1-8B-Instruct \
    --api_key test \
    --tensor_parallel_size 4 \
    --dtype float32

Client side:

import asyncio
import aiohttp

HOST_URL = "http://localhost:8000"
API_KEY = "test"
MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"

openai_request_headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}",
}

timeout = aiohttp.ClientTimeout(
    total=60*60*24,
    connect=60*60*24,
    sock_read=60*60*24,
    sock_connect=60*60*24,
    ceil_threshold=60*60*24,
)

prompts = [
    "Hey, this is a prompt"
]

async def send_request(session, index, input_prompt):
    payload = {
        "model": MODEL,
        "messages": [
            {
                "role": "user",
                "content": input_prompt,
            }
        ],
        "temperature": 0.5,
        "stream": True,
        "max_tokens": 1000,
        "logprobs": True,
    }

    url = HOST_URL + "/v1/chat/completions"
    headers = openai_request_headers

    async with session.post(url, json=payload, headers=headers, timeout=timeout) as response:
        print(response)

async def main():
    async with aiohttp.ClientSession(timeout=timeout) as session:
        tasks = []
        for index, prompt in enumerate(prompts):
            task = send_request(session, index, prompt)
            tasks.append(task)

        # Wait for all requests to finish
        await asyncio.gather(*tasks)

# Run the main function
asyncio.run(main())

Before submitting a new issue...

pgimenes commented 1 month ago

Somehow the issue seems to be resolved by setting --enable_chunked_prefill false

robertgshaw2-neuralmagic commented 1 month ago

We fall back to using XFORMERS as the attention backend when we use FP32 weights. With XFORMERS backend, we have a custom triton attention for chunked prefill (which needs "append" style attention). There seems to be a bug in this for fp32.

In general, I would not advise using fp32 dtype. Using the dtype of that the model is trained in is preferred.

pgimenes commented 1 month ago

Thanks for the clarification!