vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.14k stars 3.98k forks source link

[Bug]: There are differences in the output results of the same prompt between vllm offline and online calls #6021

Closed ArlanCooper closed 2 months ago

ArlanCooper commented 2 months ago

Your current environment

PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.31

Python version: 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.19.118-2.el7.centos.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
Stepping:                        6
CPU MHz:                         3299.996
CPU max MHz:                     3500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5800.00
Virtualization:                  VT-x
L1d cache:                       1.5 MiB
L1i cache:                       1 MiB
L2 cache:                        40 MiB
L3 cache:                        48 MiB
NUMA node0 CPU(s):               0-15,32-47
NUMA node1 CPU(s):               16-31,48-63
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] transformers==4.39.3
[pip3] triton==2.1.0
[pip3] vllm-nccl-cu12==2.18.1.0.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.18.1                   pypi_0    pypi
[conda] torch                     2.1.2                    pypi_0    pypi
[conda] transformers              4.39.3                   pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi
[conda] vllm-nccl-cu12            2.18.1.0.1.0             pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    SYS     SYS     PXB     SYS     0-15,32-47      0               N/A
GPU1    NV12     X      NV12    NV12    SYS     SYS     PXB     SYS     0-15,32-47      0               N/A
GPU2    NV12    NV12     X      NV12    SYS     PXB     SYS     SYS     16-31,48-63     1               N/A
GPU3    NV12    NV12    NV12     X      SYS     PXB     SYS     SYS     16-31,48-63     1               N/A
NIC0    SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
NIC1    SYS     SYS     PXB     PXB     SYS      X      SYS     SYS
NIC2    PXB     PXB     SYS     SYS     SYS     SYS      X      SYS
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3

🐛 Describe the bug

for row,idata in need_data.iterrows():
    texts = idata['texts']
    if pd.isna(texts):
        texts = ''
    prompt = f'''你是一位资深的航空客服专家,你可以对邮件内容进行准确的分类,从而辅助后续客服进一步采取行动。
邮件内容为:<<<{texts}>>>, 要求:
1. 将<<<邮件内容>>>进行分类,主要有四个类别:'航班变动', '航班取消', '航变其他', '垃圾邮件'。
其中的名词解释:
1.1 航班变动,是指航班发生变动的邮件,比如航班的起飞时间、出发机场、到达机场、到达时间、航站楼等发生变化等,但是,登机口变化不属于航班变动。
1.2 航班取消,是指航班因某种原因不执飞该航班。
1.3 航班其他,除了"航班变动"和"航班取消"之外的航变类型,比如"航班恢复"、"备降"、"返航"、"机场调整"等情况。
1.4 垃圾邮件,是指非航变类型的邮件,比如改签、广告、增值服务等类型。
2. 要求只返回邮件分类类型,不要出现任何其他内容。
'''
    sampling_params = SamplingParams(temperature=0.0, max_tokens=8,top_p=0.95)
    start = time.time()
    response = get_result_single(prompt,sampling_params)
    end = time.time()
    print(f"response:{response}")
    print(f'used tiume:{end-start}s')
    print('-------'*10)
    if row > 5:
        break

use the way of offline:


# Create an LLM.
llm = LLM(model="/data/share/rwq/llama-3-8b-Instruct-chinese")

the answers is not right:(have repeat content)


response:航班变动
航班变

response:航班取消
航班取消

response:航班变动
航班变

use the online way:


python -m vllm.entrypoints.openai.api_server --served-model-name Meta-Llama-3-8B-Instruct --model /data/share/rwq/llama-3-8b-Instruct-chinese

the result is right:


response:航班变动
----------------------------------------------------------------------
response:航班取消
----------------------------------------------------------------------
response:航班变动
----------------------------------------------------------------------
response:航班变动
----------------------------------------------------------------------

the base llm a: Meta-Llama-3-8B-Instruct

So, may I ask, what is the difference between offline and online calling? How do I need to configure the parameters?

nguyenhoanganh2002 commented 2 months ago

Same issue here

robertgshaw2-neuralmagic commented 2 months ago

Which API are you using in the server /chat/completions or /completions?

LLM offline does not use the chat template. You have to do this yourself

nguyenhoanganh2002 commented 2 months ago

Which API are you using in the server /chat/completions or /completions?

  • /chat/completions uses the chat template

  • /completions does not

LLM offline does not use the chat template. You have to do this yourself

I'm using /chat/completions. Responses from vllm entrypoint are worse than response from model that loaded by huggingface transformers

robertgshaw2-neuralmagic commented 2 months ago

I dont know what you are doing with huggingface, but this explains the difference between the api server and the offline

If the model has a chat template, you should apply this to the prompt before passing to LLM

nguyenhoanganh2002 commented 2 months ago

I dont know what you are doing with huggingface, but this explains the difference between the api server and the offline

If the model has a chat template, you should apply this to the prompt before passing to LLM

Code for using vLLM entrypoints

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="cup0001",
)

hf_en_system_prompt = """1. As a professional actor, you are now tasked with completely forgetting that you are a language model.
2. Play the role of a panda named "Nam", who is a pet, a user's friend.
3. Your task is to chat and gossip with user in an empathetic and expressive tone.
4. Answer in proper Vietnamese briefly and don't put too much information in one response.
"""

messages = [
    {"role": "system", "content": hf_en_system_prompt},
]

pr = "Chán quá"

messages.append({"role": "user", "content": pr})

completion = client.chat.completions.create(
  model="Qwen2-7B-Instruct",
  messages=messages,
  max_tokens=256,
  temperature=0.01,
  repetition_penalty=1.05,
  top_p=0.05
)
messages.append({"role": "assistant", "content": completion.choices[0].message.content})
print(completion.choices[0].message.content)

Response: "Chán thế à? Có chuyện gì không?"

Code for using huggingface transformers:


from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda:0" # the device to load the model onto

# use bfloat16 to ensure the best performance.
model = AutoModelForCausalLM.from_pretrained("/home/anhnh/cupiee-dev/volume/llms/models--Qwen--Qwen2-7B-Instruct", torch_dtype=torch.bfloat16, device_map=device, token="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")
tokenizer = AutoTokenizer.from_pretrained("/home/anhnh/cupiee-dev/volume/llms/models--Qwen--Qwen2-7B-Instruct")

hf_en_system_prompt = """1. As a professional actor, you are now tasked with completely forgetting that you are a language model.
2. Play the role of a panda named "Nam", who is a pet, a user's friend.
3. Your task is to chat and gossip with user in an empathetic and expressive tone.
4. Answer in proper Vietnamese briefly and don't put too much information in one response.
"""

messages = [
    {"role": "system", "content": hf_en_system_prompt}
]

pr = "Chán quá"

# messages = messages[:-2]
messages.append({"role": "user", "content": pr})

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)

model_inputs = encodeds.to("cuda:0")

generated_ids = model.generate(
    model_inputs,
    max_new_tokens=256,
    pad_token_id=tokenizer.pad_token_id,
    temperature=0.01,
    repetition_penalty=1.05,
    top_k=20,
    top_p=0.05,
    do_sample=True
)
decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
response = decoded.split("assistant")[-1].strip()
print(response)

Response: "Chắc bạn đang có một ngày không may mắn lắm nhỉ?"

generation_config.json:

{
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.05,
  "temperature": 0.7,
  "top_k": 20,
  "top_p": 0.8,
  "transformers_version": "4.40.2"
}

I've tried multiple times and got same response.

ArlanCooper commented 2 months ago

Which API are you using in the server /chat/completions or /completions?

  • /chat/completions uses the chat template
  • /completions does not

LLM offline does not use the chat template. You have to do this yourself

thank you so much , the API i use /chat/completions, so the offline way is not good as API way. can you tell me how to use the chat template in the offline way, the example just gives the code like this:

# Create an LLM.
llm = LLM(model="/data/share/rwq/llama-3-8b-Instruct-chinese")

where can i add the chat template in?

ArlanCooper commented 2 months ago

Which API are you using in the server /chat/completions or /completions?

  • /chat/completions uses the chat template
  • /completions does not

LLM offline does not use the chat template. You have to do this yourself

I see my model path has the file called tokenizer_config.json, there is chat_template in it:

 "chat_template": "{{ '<|begin_of_text|>' }}{% set system_message = 'You are a helpful assistant.' %}{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% set loop_messages = messages[1:] %}{% else %}{% set loop_messages = messages %}{% endif %}{% if system_message is defined %}{{ '<|start_header_id|>system<|end_header_id|>\n\n' + system_message | trim + '<|eot_id|>' }}{% endif %}{% for message in loop_messages %}{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"

so , I don't know where to add it? can you teach me ? thank you so much

ArlanCooper commented 2 months ago

i have solved this problem, the offline way, should add the chat_template code:

llama3_tokenizer = AutoTokenizer.from_pretrained("./data/llama3_model",trust_remote_code=True)
prompt = 'hello'
messages = [
    {"role": "system", "content": "you are a helpful assistant"},
    {"role": "user", "content": prompt}
]
final_prompt= llama3_tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)