vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.79k stars 4.5k forks source link

Inference with LLaMA 65B generates nothing but \n #450

Open foamliu opened 1 year ago

foamliu commented 1 year ago

The problem only happens with LLaMA 65B, LLaMA 7B/13B/30B work well. Below is the reproduce code:

from vllm import LLM, SamplingParams
args_model = '/mnt/sdb/ly/models/hf_converted_llama/65B/'
llm = LLM(model=args_model, tokenizer=args_model, tokenizer_mode='slow', dtype='float16', seed=42, tensor_parallel_size=8)
sampling_params = SamplingParams(temperature=0, max_tokens=10)
prompt = 'The capital of France is'
outputs = llm.generate(prompts=[prompt], sampling_params=sampling_params)
outputs
>>> outputs
[RequestOutput(request_id=0, prompt='The capital of France is', prompt_token_ids=[0, 450, 7483, 310, 3444, 338], outputs=[CompletionOutput(index=0, text='\n\n\n\n\n\n\n\n\n\n', token_ids=[13, 13, 13, 13, 13, 13, 13, 13, 13, 13], cumulative_logprob=-34.18291640281677, logprobs={}, finish_reason=length)], finished=True)]
>>> sampling_params = SamplingParams(temperature=0.1, max_tokens=10)

And HuggingFace transformers works as normal:

import transformers                  
tokenizers = transformers.LlamaTokenizer.from_pretrained("/mnt/sdb/ly/models/hf_converted_llama/65B/")
model = transformers.LlamaForCausalLM.from_pretrained("/mnt/sdb/ly/models/hf_converted_llama/65B/", device_map="auto")  
prompt = 'The capital of France is'
inputs = tokenizers.encode(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=10)
text = tokenizers.decode(outputs[0], skip_special_tokens=True)
>>> text
'The capital of France is Paris.\nThe capital of France is Paris.'
Hukongtao commented 1 year ago

I had the same problem. The outputs from vLLM and HF are inconsistent

lucasjinreal commented 1 year ago

Does this happend only on 65B model? Am using 7B normally

HermitSun commented 1 year ago

Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have.

Hukongtao commented 1 year ago

Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have. generation args of vllm: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1161

foamliu commented 1 year ago

Does this happend only on 65B model? Am using 7B normally

The problem only happens with LLaMA 65B, LLaMA 7B/13B/30B work well.

foamliu commented 1 year ago

Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have. generation args of vllm: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1161

Generation args are in the code above, both should use greedy decoding so shouldn't be too different (all '\n' from vllm).

foamliu commented 1 year ago

I had the same problem. The outputs from vLLM and HF are inconsistent

Yes, I found this problem too. In the case of greedy decoding, although LLaMA 7B, 13B, 30B can get meaningful output, the output results are different from HF transformers.

For example, the following are the scores of my evaluation with several benchmarks:

GSM8k

LLaMA 7B LLaMA 13B LLaMA 30B
vLLM 9.40 15.01 24.94
HF 10.46 14.86 30.40
MMLU LLaMA 7B LLaMA 13B LLaMA 30B
vLLM 35.8 46.9 48.9
HF 34.1 46.7 57.8
lucasjinreal commented 1 year ago

The generation params can heavily effect final model performance.

MM-IR commented 1 year ago

So is it reliable to evaluate LLaMa results using your scripts? - That is really weird..

foamliu commented 1 year ago

So is it reliable to evaluate LLaMa results using your scripts? - That is really weird..

The same result can be stably reproduced on my V100 server.

andyfeih commented 1 year ago

vllm just failed to load weights, for example, vllm has no support of safetensors yet

foamliu commented 1 year ago

vllm just failed to load weights, for example, vllm has no support of safetensors yet

vLLM does not yet support safetensors, but this does not prevent us from converting the llama model into a format similar to pytorch_model-00001-of-00003.bin then loading it with vllm.

CtfGo commented 1 year ago

I have also encountered the same problem, the same prompt can not produce the same output, with sampling params for greedy, anyone is resolving this ?

params HF vLLM
top_p 1.0 1.0
top_k -1 -1
temperature 0.0 0.0
syskn commented 1 year ago

I have been trying various models and outputs from vLLM I get are consistently and significantly more deterministic (tends to work like greedy decoding and have severe repetition issue with temperature below 0.7) than HF implementation.

I compared through sampling process and I could not find a difference - if greedy doesn't match, then it could be something in PagedAttention or cuda kernels?

lw921014 commented 1 year ago

image for LLAMA 65B,you'd better midfy you tokenizer bos as 1 (which is 0 for llama 13b).

luohao123 commented 1 year ago

@syskn

I have been trying various models and outputs from vLLM I get are consistently and significantly more deterministic (tends to work like greedy decoding and have severe repetition issue with temperature below 0.7) than HF implementation.

I compared through sampling process and I could not find a difference - if greedy doesn't match, then it could be something in PagedAttention or cuda kernels?

See my issue here: https://github.com/vllm-project/vllm/issues/706

I set same params,but result are totally wrong, the robot looks like sutpid than hf version.....

oushu1zhangxiangxuan1 commented 11 months ago

Encountered the same problem

phamkhactu commented 11 months ago

Encountered the same problem

Yes, me too.

will-wiki commented 10 months ago

Encountered the same problem,my issue

pvtoan commented 10 months ago

Hi,

Anyone please reproduce the answer from LLama2-7B-Chat with the prompt "hello"

Because, in my case, I just get a weird answer: "@matthew-james.com"

I used exactly the same code as @foamliu when using vllm with LLama2-7B-Chat .

Thank you for your time and help!

ArlanCooper commented 8 months ago

I had the same problem. The outputs from vLLM and HF are inconsistent

Yes, I found this problem too. In the case of greedy decoding, although LLaMA 7B, 13B, 30B can get meaningful output, the output results are different from HF transformers.

For example, the following are the scores of my evaluation with several benchmarks:

GSM8k

LLaMA 7B LLaMA 13B LLaMA 30B vLLM 9.40 15.01 24.94 HF 10.46 14.86 30.40 MMLU

LLaMA 7B LLaMA 13B LLaMA 30B vLLM 35.8 46.9 48.9 HF 34.1 46.7 57.8

awsome!!

thehir0 commented 5 months ago

Encountered the same problem when using model with dynamic rope scaling.

"rope_scaling": { "factor": 8.0, "type": "dynamic" },

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!