Open foamliu opened 1 year ago
I had the same problem. The outputs from vLLM and HF are inconsistent
Does this happend only on 65B model? Am using 7B normally
Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have.
Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have. generation args of vllm: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1161
Does this happend only on 65B model? Am using 7B normally
The problem only happens with LLaMA 65B, LLaMA 7B/13B/30B work well.
Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have. generation args of vllm: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1161
Generation args are in the code above, both should use greedy decoding so shouldn't be too different (all '\n' from vllm).
I had the same problem. The outputs from vLLM and HF are inconsistent
Yes, I found this problem too. In the case of greedy decoding, although LLaMA 7B, 13B, 30B can get meaningful output, the output results are different from HF transformers.
For example, the following are the scores of my evaluation with several benchmarks:
GSM8k
LLaMA 7B | LLaMA 13B | LLaMA 30B | |
---|---|---|---|
vLLM | 9.40 | 15.01 | 24.94 |
HF | 10.46 | 14.86 | 30.40 |
MMLU | LLaMA 7B | LLaMA 13B | LLaMA 30B | |
---|---|---|---|---|
vLLM | 35.8 | 46.9 | 48.9 | |
HF | 34.1 | 46.7 | 57.8 |
The generation params can heavily effect final model performance.
So is it reliable to evaluate LLaMa results using your scripts? - That is really weird..
So is it reliable to evaluate LLaMa results using your scripts? - That is really weird..
The same result can be stably reproduced on my V100 server.
vllm just failed to load weights, for example, vllm has no support of safetensors yet
vllm just failed to load weights, for example, vllm has no support of safetensors yet
vLLM does not yet support safetensors, but this does not prevent us from converting the llama model into a format similar to pytorch_model-00001-of-00003.bin then loading it with vllm.
I have also encountered the same problem, the same prompt can not produce the same output, with sampling params for greedy, anyone is resolving this ?
params | HF | vLLM |
---|---|---|
top_p | 1.0 | 1.0 |
top_k | -1 | -1 |
temperature | 0.0 | 0.0 |
I have been trying various models and outputs from vLLM I get are consistently and significantly more deterministic (tends to work like greedy decoding and have severe repetition issue with temperature below 0.7) than HF implementation.
I compared through sampling process and I could not find a difference - if greedy doesn't match, then it could be something in PagedAttention or cuda kernels?
for LLAMA 65B,you'd better midfy you tokenizer bos as 1 (which is 0 for llama 13b).
@syskn
I have been trying various models and outputs from vLLM I get are consistently and significantly more deterministic (tends to work like greedy decoding and have severe repetition issue with temperature below 0.7) than HF implementation.
I compared through sampling process and I could not find a difference - if greedy doesn't match, then it could be something in PagedAttention or cuda kernels?
See my issue here: https://github.com/vllm-project/vllm/issues/706
I set same params,but result are totally wrong, the robot looks like sutpid than hf version.....
Encountered the same problem
Encountered the same problem
Yes, me too.
Hi,
Anyone please reproduce the answer from LLama2-7B-Chat with the prompt "hello"
Because, in my case, I just get a weird answer: "@matthew-james.com"
I used exactly the same code as @foamliu when using vllm with LLama2-7B-Chat .
Thank you for your time and help!
I had the same problem. The outputs from vLLM and HF are inconsistent
Yes, I found this problem too. In the case of greedy decoding, although LLaMA 7B, 13B, 30B can get meaningful output, the output results are different from HF transformers.
For example, the following are the scores of my evaluation with several benchmarks:
GSM8k
LLaMA 7B LLaMA 13B LLaMA 30B vLLM 9.40 15.01 24.94 HF 10.46 14.86 30.40 MMLU
LLaMA 7B LLaMA 13B LLaMA 30B vLLM 35.8 46.9 48.9 HF 34.1 46.7 57.8
awsome!!
Encountered the same problem when using model with dynamic rope scaling.
"rope_scaling": { "factor": 8.0, "type": "dynamic" },
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
The problem only happens with LLaMA 65B, LLaMA 7B/13B/30B work well. Below is the reproduce code:
And HuggingFace transformers works as normal: