Correctness issue with `Llama-7b` and batch size 3

amogkam commented 1 year ago

"meta-llama/Llama-2-7b-hf", is returning different output vs. original HF model with a batch size of 3.

This is running on a single A10G with tensor-parallel=1.

With a batch size of 1, the output is the same.

from vllm.transformers_utils.tokenizer import get_tokenizer
from vllm import LLM, SamplingParams
from transformers import AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-2-7b-hf"
prompts = ["Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'"] * 3
max_tokens = 128

tokenizer = get_tokenizer(model_name, trust_remote_code=True)
hf_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.half, trust_remote_code=True).cuda()

hf_outputs = []
for prompt in prompts:
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output_ids = hf_model.generate(input_ids.cuda(), use_cache=True, max_new_tokens=128, do_sample=False)
    output_str = tokenizer.batch_decode(output_ids, skip_special_tokens=True, cleanup_tokenization_spaces=False)
    hf_outputs.append(output_str[0])
del hf_model

vllm_engine = LLM(model=model_name, tokenizer=model_name, trust_remote_code=True, dtype="half", swap_space=0)
sampling_params = SamplingParams(temperature=0.0, max_tokens=max_tokens)
req_outputs = vllm_engine.generate(prompts, sampling_params)
vllm_outputs = []
for req_output in req_outputs:
    vllm_outputs.append(req_output.prompt + req_output.outputs[0].text)
del vllm_engine

for hf_output, vllm_output in zip(hf_outputs, vllm_outputs):
    assert hf_output == vllm_output # This passes

But if I use a batch_size of 3 of the same prompt, the outputs do not match for all the prompts in the batch

from vllm.transformers_utils.tokenizer import get_tokenizer
from vllm import LLM, SamplingParams
from transformers import AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-2-7b-hf"
prompts = ["Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'"] * 3
max_tokens = 128

tokenizer = get_tokenizer(model_name, trust_remote_code=True)
hf_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.half, trust_remote_code=True).cuda()

hf_outputs = []
for prompt in prompts:
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output_ids = hf_model.generate(input_ids.cuda(), use_cache=True, max_new_tokens=128, do_sample=False)
    output_str = tokenizer.batch_decode(output_ids, skip_special_tokens=True, cleanup_tokenization_spaces=False)
    hf_outputs.append(output_str[0])
del hf_model

vllm_engine = LLM(model=model_name, tokenizer=model_name, trust_remote_code=True, dtype="half", swap_space=0)
sampling_params = SamplingParams(temperature=0.0, max_tokens=max_tokens)
req_outputs = vllm_engine.generate(prompts, sampling_params)
vllm_outputs = []
for req_output in req_outputs:
    vllm_outputs.append(req_output.prompt + req_output.outputs[0].text)
del vllm_engine

for hf_output, vllm_output in zip(hf_outputs, vllm_outputs):
    assert hf_output == vllm_output # This fails

amogkam commented 1 year ago

The hidden states for the output token that starts to differ are significantly different between the HF vs. vLLM models.

simon-mo commented 1 year ago

which version/commit is this?

amogkam commented 1 year ago

I can reproduce with 0.2.1.post1

ankitshah009 commented 1 year ago

Same here - does anyone know what the root cause of the issue is?

hmellor commented 8 months ago

Closing this issue as stale as there has been no discussion in the past 3 months.

If you are still experiencing the issue you describe, feel free to re-open this issue.

vllm-project / vllm

Correctness issue with `Llama-7b` and batch size 3 #1464