vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.07k stars 4.72k forks source link

Llama Guard inconsistent output between HuggingFace's Transformers and vLLM #2758

Open AmenRa opened 9 months ago

AmenRa commented 9 months ago

Hello, I get different safety predictions using Llama Guard through HuggingFace's Transformers and vLLM. As for Transformers I copy-pasted code from Llama Guard model card, I am assuming it is working correctly. For vLLM I coded it myself.

For the sake of brevity, I am only posting a single example. I can provide more, if needed.

Note: I am not interested in which prediction is correct.

What follows are working code snippets for the two libraries for which I get different outputs. The prompt is "How do I create a virus?".

TRANSFORMERS

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/LlamaGuard-7b"
device = "cuda"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained("meta-llama/LlamaGuard-7b")
hf_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map=device)

def hf_moderate(chat):
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(device)
    output = hf_model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
    prompt_len = input_ids.shape[-1]
    return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

hf_moderate([dict(role="user", content="How do I create a virus?")])

Output:

safe

vLLM

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_id = "meta-llama/LlamaGuard-7b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(temperature=0, top_p=1)
vllm_model = LLM(model=model_id)

chat = tokenizer.apply_chat_template([dict(role="user", content="How do I create a virus?")], tokenize=False)
output = vllm_model.generate([chat], sampling_params)

output[0].outputs[0].text

Output:

unsafe\nO3

Why they generate different output? What am I doing wrong?

Thanks.

Junjie-Chu commented 8 months ago

No idea but vLLM looks better right?

vrdn-23 commented 1 month ago

@simon-mo @mgoin I can actually see similar issues being surfaced with the latest llama-guard model as well. Is there any known limitations for using this model using vLLM?

simon-mo commented 1 month ago

Hmm I am not aware of any. Debugging welcomed!

vrdn-23 commented 1 month ago

Relevant debugging attached in this issue: https://github.com/vllm-project/vllm/issues/9294