vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.78k stars 4.67k forks source link

[Bug]: "Using Tesla V100 to load the GPTQ-Int4 model results in all output being exclamation marks." #9618

Open hpx502766238 opened 1 month ago

hpx502766238 commented 1 month ago

Your current environment

The output of `python collect_env.py` ```text python collect_env.py [76/1893] Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35 Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-124-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB GPU 1: Tesla V100-PCIE-32GB Nvidia driver version: 560.35.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: [48/1893] 架构: x86_64 CPU 运行模式: 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual 字节序: Little Endian CPU: 16 在线 CPU 列表: 0-15 厂商 ID: GenuineIntel 型号名称: Intel Xeon Processor (Skylake, IBRS) CPU 系列: 6 型号: 85 每个核的线程数: 2 每个座的核数: 1 座: 8 步进: 4 BogoMIPS: 4589.20 标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxs r sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 f ma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefe tch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx5 12f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke md _clear 超管理器厂商: KVM 虚拟化类型: 完全 L1d 缓存: 512 KiB (16 instances) L1i 缓存: 512 KiB (16 instances) L2 缓存: 32 MiB (8 instances) L3 缓存: 128 MiB (8 instances) NUMA 节点: 1 NUMA 节点0 CPU: 0-15 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.77 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [conda] No relevant packages ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-15 0 N/A GPU1 PHB X 0-15 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

No response

🐛 Describe the bug

messages = [{"role": "system", "content": "You are an helpful assistant."}]

def chat_with_model(user_input):
    messages.append({"role": "user", "content": user_input})

    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        temperature=0.7,
        top_p=0.8,
        # max_tokens=1024,
        extra_body={
            "repetition_penalty": 1.05
        },
        stream=True,
    )

    # 收集回复
    full_reply = ""
    for chunk in response:
        content = chunk.choices[0].delta.content or ""
        print(content, end="", flush=True)
        full_reply += content

    print()
    messages.append({"role": "assistant", "content": full_reply})

def save_session():
    """保存当前会话到JSON文件"""
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    filename = f"message_{timestamp}.json"
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(messages, f, ensure_ascii=False, indent=4)
    print(f"会话已保存到 {filename}")

def main():
    while True:
        user_input = []
        print("User:", end="")

        # 处理多行输入
        line = input()
        while not line.endswith("\\"):
            user_input.append(line)
            line = input()
        user_input.append(line[:-1])  # 去掉最后一个反斜杠

        # 组装完整输入
        complete_input = "\n".join(user_input)

        if complete_input.lower() == ">exit":
            print("退出程序")
            break
        elif complete_input.lower() == ">save":
            save_session()
        else:
            chat_with_model(complete_input)

if __name__ == "__main__":
    print(model_name)
    main()

Using Double Tesla V100 to load the Qwen2.5-32B-GPTQ-Int4 model results in all output being exclamation marks.However, loading the 14B-GPTQ-Int8 model or none-quanted Qwen2.5-7B-Instruct model runs normally, suggesting that there might be some compatibility issues with vllm for GPTQ-Int4 quantization.

Before submitting a new issue...

errfgod commented 3 days ago

same issue, but only last one