gpustack / gpustack

Manage GPU clusters for running LLMs
https://gpustack.ai
Apache License 2.0
335 stars 24 forks source link

Inference server exited with code 0 #283

Closed pengjiang80 closed 2 weeks ago

pengjiang80 commented 2 weeks ago

Describe the bug

Environment

Steps to reproduce

  1. Install GPUStack, deploy a model (llama 3.1 or qwen2).
  2. Go to playground and interact with mode for 5-7 rounds (prompt: tell me a story)
  3. The model will not give response after some rounds. Check the model status, it will show error with message "inference server exited with code 0".
  4. In GPUStack log file, there is the following error message:

2024-09-15T23:30:40+08:00 - gpustack.api.middlewares - ERROR - Error processing streaming response: unhandled errors in a TaskGroup (1 sub-exception)

  1. In the serve log, the error is as below:
    [1726414035] warming up the model with an empty run
    INFO [                          main] model loaded | tid="140308454092800" ts=1726414035
    INFO [                          main] initializing server | hostname="0.0.0.0" n_threads_http="55" port="40406" tid="140308454092800" ts=1726414035
    INFO [                          init] initializing slots | n_slots=4 tid="140308454092800" ts=1726414035
    INFO [                          init] new slot | id_slot=0 n_ctx_slot=2048 tid="140308454092800" ts=1726414035
    INFO [                          init] new slot | id_slot=1 n_ctx_slot=2048 tid="140308454092800" ts=1726414035
    INFO [                          init] new slot | id_slot=2 n_ctx_slot=2048 tid="140308454092800" ts=1726414035
    INFO [                          init] new slot | id_slot=3 n_ctx_slot=2048 tid="140308454092800" ts=1726414035
    INFO [                          main] server initialized | tid="140308454092800" ts=1726414035
    INFO [                          main] chat template | builtin=true example="<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" template="{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n    {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n    {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n    {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n    {#- Extract the first user message so we can plug it in here #}\n    {%- if messages | length != 0 %}\n        {%- set first_user_message = messages[0]['content']|trim %}\n        {%- set messages = messages[1:] %}\n    {%- else %}\n        {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n    {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n    {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n    {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n    {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n        {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n    {%- elif 'tool_calls' in message %}\n        {%- if not message.tool_calls|length == 1 %}\n            {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n        {%- endif %}\n        {%- set tool_call = message.tool_calls[0].function %}\n        {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n            {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n            {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n            {%- for arg_name, arg_val in tool_call.arguments | items %}\n                {{- arg_name + '=\"' + arg_val + '\"' }}\n                {%- if not loop.last %}\n                    {{- \", \" }}\n                {%- endif %}\n                {%- endfor %}\n            {{- \")\" }}\n        {%- else  %}\n            {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n            {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n            {{- '\"parameters\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- \"}\" }}\n        {%- endif %}\n        {%- if builtin_tools is defined %}\n            {#- This means we're in ipython mode #}\n            {{- \"<|eom_id|>\" }}\n        {%- else %}\n            {{- \"<|eot_id|>\" }}\n        {%- endif %}\n    {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n        {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n        {%- if message.content is mapping or message.content is iterable %}\n            {{- message.content | tojson }}\n        {%- else %}\n            {{- message.content }}\n        {%- endif %}\n        {{- \"<|eot_id|>\" }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n" tid="140308454092800" ts=1726414035
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414035
    INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307656077312" ts=1726414123
    INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=0 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414123
    INFO [                       release] slot released | id_slot=0 id_task=0 n_past=250 tid="140308454092800" truncated=false ts=1726414128
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414128
    INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41192 status=200 tid="140307656077312" ts=1726414128
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414128
    INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307647684608" ts=1726414135
    INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=236 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414135
    INFO [                       release] slot released | id_slot=0 id_task=236 n_past=542 tid="140308454092800" truncated=false ts=1726414141
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414141
    INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41194 status=200 tid="140307647684608" ts=1726414141
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414141
    INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307639291904" ts=1726414166
    INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=514 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414166
    INFO [                       release] slot released | id_slot=0 id_task=514 n_past=926 tid="140308454092800" truncated=false ts=1726414175
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414175
    INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41196 status=200 tid="140307639291904" ts=1726414175
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414175
    INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307630899200" ts=1726414182
    INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=884 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414182
    INFO [                       release] slot released | id_slot=0 id_task=884 n_past=1354 tid="140308454092800" truncated=false ts=1726414193
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414193
    INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41198 status=200 tid="140307630899200" ts=1726414193
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414193
    INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307622506496" ts=1726414198
    INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=1298 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414198
    INFO [                       release] slot released | id_slot=0 id_task=1298 n_past=1810 tid="140308454092800" truncated=false ts=1726414211
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414211
    INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41200 status=200 tid="140307622506496" ts=1726414211
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414211
    INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307614113792" ts=1726414218
    INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=1740 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414218
    INFO [                  update_slots] slot context shift | id_slot=0 id_task=1740 n_cache_tokens=0 n_ctx=8192 n_discard=1023 n_keep=1 n_left=2046 n_past=2047 n_system_tokens=0 tid="140308454092800" ts=1726414226
    INFO [                       release] slot released | id_slot=0 id_task=1740 n_past=1228 tid="140308454092800" truncated=true ts=1726414231
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414231
    INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41202 status=200 tid="140307614113792" ts=1726414231
    INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414231
    INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307605721088" ts=1726414239
    INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=2167 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414239
    /home/runner/work/llama-box/llama-box/llama-box/server.cpp:2368: GGML_ASSERT(batch.n_tokens > 0) failed
thxCode commented 2 weeks ago

fixed on https://github.com/gpustack/llama-box/commit/a8d562df9264d79182bd282643fa1e056ae09736, need to bump llama-box. cc @gitlawr @aiwantaozi .

The root cause is that the sliding window for long prompt content assigns a wrong state value.

gitlawr commented 2 weeks ago

Verified on main