[Bug]: TypeError: endswith first arg must be str or a tuple of str, not bytes

Your current environment

Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 551.52
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 183
Model name:            13th Gen Intel(R) Core(TM) i7-13700F
Stepping:              1
CPU MHz:               2111.998
BogoMIPS:              4223.99
Virtualization:        VT-x
Hypervisor vendor:     Microsoft
Virtualization type:   full
L1d cache:             48K
L1i cache:             32K
L2 cache:              2048K
L3 cache:              30720K
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] torchvision==0.16.2
[pip3] triton==2.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.1.2                    pypi_0    pypi
[conda] torchvision               0.16.2                   pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

The model I am using is Qwen-7B-Chat, and the configuration file is as follows:

{
"eos_token_id": [151643, 151645],
"pad_token_id": 151643,
"max_new_tokens": 1024,
"do_sample": true,
"top_k": 0,
"top_p": 0.8,
"transformers_version": "4.34.0"
}

The Jinja template is as follows:

{%- if not add_generation_prompt is defined -%}
{%- set add_generation_prompt = false -%}
{%- endif -%}
{%- if messages[0]['role'] == 'system' -%}
{%- set loop_messages = messages[1:] -%}
{%- set system_message = messages[0]['content'] -%}
{%- else -%}
{%- set loop_messages = messages -%}
{%- set system_message = 'You are a helpful assistant.' -%}
{%- endif -%}
{{ '<|im_start|>' + 'system' + '\n' + system_message + '<|im_end|>' + '\n' }}
{%- for message in loop_messages -%}
{{ '<|im_start|>' + message['role'] + '\n' + message['content']}}
{%- if (loop.last and add_generation_prompt) or not loop.last -%}
{{ '<|im_end|>' + '\n' }}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt and messages[-1]['role'] != 'assistant' -%}
{{ '<|im_start|>' + 'assistant' + '\n' }}
{%- endif -%}

When I make a request using the React template for tool invocation, I encounter an error at the end of the response phase. The error occurs at line 959 in vllm.engine.llm_engine.py.

if not sampling_params.include_stop_str_in_output and stop_string:

The error is caused by the stop_string "\bObserv" here, which leads to the following error message.

error msg

Traceback (most recent call last): File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi result = await app( # type: ignore[func-returns-value] File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call return await self.app(scope, receive, send) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/routing.py", line 758, in call await self.middleware_stack(scope, receive, send) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/routing.py", line 778, in app await route.handle(scope, receive, send) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle await self.app(scope, receive, send) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/routing.py", line 79, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/starlette/routing.py", line 74, in app response = await func(request) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/imitater/service/chat.py", line 97, in create_chat_completion return await _create_local_chat_completion(request, model) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/imitater/service/chat.py", line 61, in _create_local_chat_completion result, prompt_tokens, completion_tokens = await model.function_call(**input_kwargs) File "/opt/anaconda3/envs/chat2bi/lib/python3.10/site-packages/imitater/model/chat_model.py", line 218, in function_call if generated_text.endswith(stop_token): TypeError: endswith first arg must be str or a tuple of str, not bytes

My code:

async def _generate(self, messages: List[Dict[str, str]], request_id: str, **gen_kwargs):
    input_ids = self._tokenizer.apply_chat_template(
        conversation=messages, tokenize=True, add_generation_prompt=True
    )
    sampling_params = SamplingParams(
        temperature=gen_kwargs.pop("temperature", None) or self._generation_config.temperature,
        top_p=gen_kwargs.pop("top_p", None) or self._generation_config.top_p,
        max_tokens=gen_kwargs.pop("max_tokens", None) or self._generation_config.max_new_tokens,
        stop_token_ids=self._generation_config.eos_token_id + gen_kwargs.pop("stop_token_ids", []),
    )
    result_generator = self._engine.generate(
        prompt=None, sampling_params=sampling_params, request_id=request_id, prompt_token_ids=input_ids
    )
    return result_generator

......other code

final_result = None
generator = await self._generate(messages, request_id, **gen_kwargs)
async for result in generator:
    print(result.outputs[0].text)
    final_result = result

vllm-project / vllm

[Bug]: TypeError: endswith first arg must be str or a tuple of str, not bytes #3447

Your current environment

🐛 Describe the bug