Inference server exited with code 0

Describe the bug
Environment
GPUStack version: v0.2.0
OS: Ubuntu 22.04
GPU: Nvidia P40, T4, H800 (all can reproduce this issue)
Steps to reproduce
Install GPUStack, deploy a model (llama 3.1 or qwen2).
Go to playground and interact with mode for 5-7 rounds (prompt: tell me a story)
The model will not give response after some rounds. Check the model status, it will show error with message "inference server exited with code 0".
In GPUStack log file, there is the following error message:
2024-09-15T23:30:40+08:00 - gpustack.api.middlewares - ERROR - Error processing streaming response: unhandled errors in a TaskGroup (1 sub-exception)
In the serve log, the error is as below:
[1726414035] warming up the model with an empty run
INFO [                          main] model loaded | tid="140308454092800" ts=1726414035
INFO [                          main] initializing server | hostname="0.0.0.0" n_threads_http="55" port="40406" tid="140308454092800" ts=1726414035
INFO [                          init] initializing slots | n_slots=4 tid="140308454092800" ts=1726414035
INFO [                          init] new slot | id_slot=0 n_ctx_slot=2048 tid="140308454092800" ts=1726414035
INFO [                          init] new slot | id_slot=1 n_ctx_slot=2048 tid="140308454092800" ts=1726414035
INFO [                          init] new slot | id_slot=2 n_ctx_slot=2048 tid="140308454092800" ts=1726414035
INFO [                          init] new slot | id_slot=3 n_ctx_slot=2048 tid="140308454092800" ts=1726414035
INFO [                          main] server initialized | tid="140308454092800" ts=1726414035
INFO [                          main] chat template | builtin=true example="<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" template="{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n    {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n    {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n    {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n    {#- Extract the first user message so we can plug it in here #}\n    {%- if messages | length != 0 %}\n        {%- set first_user_message = messages[0]['content']|trim %}\n        {%- set messages = messages[1:] %}\n    {%- else %}\n        {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n    {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n    {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n    {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n    {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n        {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n    {%- elif 'tool_calls' in message %}\n        {%- if not message.tool_calls|length == 1 %}\n            {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n        {%- endif %}\n        {%- set tool_call = message.tool_calls[0].function %}\n        {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n            {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n            {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n            {%- for arg_name, arg_val in tool_call.arguments | items %}\n                {{- arg_name + '=\"' + arg_val + '\"' }}\n                {%- if not loop.last %}\n                    {{- \", \" }}\n                {%- endif %}\n                {%- endfor %}\n            {{- \")\" }}\n        {%- else  %}\n            {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n            {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n            {{- '\"parameters\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- \"}\" }}\n        {%- endif %}\n        {%- if builtin_tools is defined %}\n            {#- This means we're in ipython mode #}\n            {{- \"<|eom_id|>\" }}\n        {%- else %}\n            {{- \"<|eot_id|>\" }}\n        {%- endif %}\n    {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n        {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n        {%- if message.content is mapping or message.content is iterable %}\n            {{- message.content | tojson }}\n        {%- else %}\n            {{- message.content }}\n        {%- endif %}\n        {{- \"<|eot_id|>\" }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n" tid="140308454092800" ts=1726414035
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414035
INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307656077312" ts=1726414123
INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=0 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414123
INFO [                       release] slot released | id_slot=0 id_task=0 n_past=250 tid="140308454092800" truncated=false ts=1726414128
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414128
INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41192 status=200 tid="140307656077312" ts=1726414128
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414128
INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307647684608" ts=1726414135
INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=236 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414135
INFO [                       release] slot released | id_slot=0 id_task=236 n_past=542 tid="140308454092800" truncated=false ts=1726414141
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414141
INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41194 status=200 tid="140307647684608" ts=1726414141
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414141
INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307639291904" ts=1726414166
INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=514 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414166
INFO [                       release] slot released | id_slot=0 id_task=514 n_past=926 tid="140308454092800" truncated=false ts=1726414175
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414175
INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41196 status=200 tid="140307639291904" ts=1726414175
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414175
INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307630899200" ts=1726414182
INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=884 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414182
INFO [                       release] slot released | id_slot=0 id_task=884 n_past=1354 tid="140308454092800" truncated=false ts=1726414193
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414193
INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41198 status=200 tid="140307630899200" ts=1726414193
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414193
INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307622506496" ts=1726414198
INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=1298 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414198
INFO [                       release] slot released | id_slot=0 id_task=1298 n_past=1810 tid="140308454092800" truncated=false ts=1726414211
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414211
INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41200 status=200 tid="140307622506496" ts=1726414211
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414211
INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307614113792" ts=1726414218
INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=1740 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414218
INFO [                  update_slots] slot context shift | id_slot=0 id_task=1740 n_cache_tokens=0 n_ctx=8192 n_discard=1023 n_keep=1 n_left=2046 n_past=2047 n_system_tokens=0 tid="140308454092800" ts=1726414226
INFO [                       release] slot released | id_slot=0 id_task=1740 n_past=1228 tid="140308454092800" truncated=true ts=1726414231
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414231
INFO [            log_server_request] request | method="POST" params={} path="/v1/chat/completions" remote_addr="10.18.172.121" remote_port=41202 status=200 tid="140307614113792" ts=1726414231
INFO [                  update_slots] all slots are idle | tid="140308454092800" ts=1726414231
INFO [  oaicompat_completion_request] OAI request | params={"max_tokens":1024,"messages":"[...]","model":"llama3.1","stream":true,"temperature":1,"top_p":1} tid="140307605721088" ts=1726414239
INFO [         launch_slot_with_task] slot is processing task | id_slot=0 id_task=2167 max_tokens_per_second="N/A" tid="140308454092800" ts=1726414239
/home/runner/work/llama-box/llama-box/llama-box/server.cpp:2368: GGML_ASSERT(batch.n_tokens > 0) failed
gpustack / gpustack

Inference server exited with code 0 #283