vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.05k stars 3.97k forks source link

[Bug] [ROCm]: ROCm fails to stop generating tokens on multiple GPTQ models #7011

Open TNT3530 opened 1 month ago

TNT3530 commented 1 month ago

Your current environment

My Environment

OpenAI API launched using this command:

VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_NCCL_SO_PATH=/opt/rocm/lib/librccl.so.1 python -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.7 --tensor-parallel-size 4 --model <model> --enforce-eager --swap-space 0 --port 5000 --quantization gptq --max-model-length 32768

--dtype half was used for Llama models
--chat-template <template> was used for Command-R models

Docker launched using this command:

sudo docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri -v <mounted folder>:/vllm-workspace/model vllm:<version> bash

🐛 Describe the bug

When loading these models Command R Plus Llama 3.1 70B Llama 3.1 70B Alternate using a docker image built from source as of 2024-07-24, every prompt continues to generate until (i assume) hitting the token limit. It does this regardless of passed parameters like temperature. This has been happening with Command R+ since release, with versions ranging from 0.4.1 to 0.5.3post1.

Normal Command-R works using this model Llama 3.1 Instruct 8B straight from Meta works as well

Here is a sample of what Command R Plus generates 321406997-d1945c0b-414d-4a7a-a26b-98d8f1e8a096 I cant give a similar sample of Llama since the script used to generate the above throws an unrelated error about freeze_support() and all API calls just time out with no response.

I tried force updating Transformers via PIP in the container but it did not fix the issue.

This is on a 4x AMD Instinct MI100 system with a GPU bridge

tutu329 commented 1 month ago

though it can not be stpped, i want to know what is the infer speed through 4xmi100. thanks a lot

TNT3530 commented 1 month ago

though it can not be stpped, i want to know what is the infer speed through 4xmi100. thanks a lot

https://new.reddit.com/user/TNT3530/comments/1akazn8/amd_instinct_mi100_benchmarks_across_multiple_llm/

TNT3530 commented 1 month ago

Just tested 0.4.1, 0.5.2 and 0.5.3post1 with Mistral Large Instruct 2407 GPTQ and the same thing happens. It's interesting how three different model architectures all have the same issue when quantized.

TNT3530 commented 1 month ago

After reviewing the output of Llama 3.1 70b and Mistral Large via streaming and lowering the max response length, it seems generation continues due to the lack of a stop token. Here is the nonsense being generated: image

TNT3530 commented 1 month ago

Loading these models with auto_gptq using the following script

import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

model_path = "<model>"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoGPTQForCausalLM.from_quantized(
    model_path,
    use_safetensors=True,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto",
    disable_exllama=True,
    use_fast=True,
    use_triton=True
)

prompt = [
    { "role": "system", "content": "You are a helpful assistant that responds to user inquiries." },
    { "role": "user", "content": "What is the main benefit of AI Assistants?" }
]

inputs = tokenizer.apply_chat_template(
    prompt,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=128)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Both Llama 3.1 and Mistral Large had good results in multiple trial runs. Cohere sadly isnt supported in auto_gptq so Command R+ couldnt be tested.