Closed normster closed 5 months ago
I got similar issue. can't reproduce the results from HF inference with all default parameters but topk=30 topp=0.75 maxtoken=1024
I got the similar problem with baichuan-13B
I got the similar problem with baichuan-13B
Hi @paulcx @normster
Do you have any more information ?
nope. I can not reproduce the resultscompared to HF running greedy decoding
same problem with yi-34b-chat(3 quantized models: official yi-34b-chat-4bits, awq and gptq versions from TheBloke) sampling params: vllm default settings system: "You are a helpful assistant." prompt: "1+1=?不用解释,直接给出答案:" transfromers: "1 + 1 = 2" vllm: "1 + 1 = 2 \n\n这个答案是基于基本的数学运算,将两个数字相加。 \n\n如果你有其他的问题,或者需要帮助理解其他问题,请随时告诉我! \n\n如果你是准备考试或者学习新知识,我会尽力提供帮助。 \n\n祝你学习顺利,如果需要更多帮助,请随时提问。\n\n \n\n如果你是准备考试或者学习新知识,我会尽力提供帮助"
Still not resolved from what I'm seeing
I observed a discrepancy between Hugging Face and vLLM. I'm currently using version 0.3.0 due to NCCL issues (which I'm working on resolving). In my tests with Mistral and Mixtral8x7b models, I found discrepancies when using the bfloat16 data type.
While both vLLM and Hugging Face results seem reasonable, shouldn't we be getting identical outputs with the same settings (no sampling, topk=1, etc.)? Interestingly, switching the data type to float16 produces identical results in both cases.
This issue has been persisting for half a year without being resolved or even identified as the cause, which is quite frustrating.
Strict Consistency: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to test_models.py and test_big_models.py for the models that have passed this test.
How should I set the api parameters to achieve the same effect as using vllm_model.generate_greedy method in test_models.py?
same problem when I use qwen1.5 ,very different between HuggingFace Transformers and vllm
Please check out https://docs.vllm.ai/en/latest/models/supported_models.html#model-support-policy .
How is this supposed to help?
vllm provides invaluably better code than hf, but I noticed that the outputs of the models are of a lower quality most of the time, to the point that it becomes unusable.
Are we doing something wrong? If not, is there any plan to look into this?
I'm getting inconsistent results between HF and vllm with llama2-7b running greedy decoding:
HF version:
which yields:
vllm version:
which yields: