Inference fails after prompt evaluation with llama-cpp backend with error:
CUDA error: invalid argument
current device: 1, in function ggml_backend_cuda_graph_compute at /home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-python-cuBLAS-wheels/vendor/llama.cpp/ggml/src/ggml-cuda.cu:2694
cudaGraphKernelNodeSetParams(cuda_ctx->cuda_graph->nodes[i], &cuda_ctx->cuda_graph->params[i])
/home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-python-cuBLAS-wheels/vendor/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
Is there an existing issue for this?
[X] I have searched the existing issues
Reproduction
Clone the repo;
Execute ./start_linux.sh --listen --api;
Load a model converted and quantized with llama-cpp (in my particular case - Luminum-v0.1-123B-Q4_K_M.gguf, but the situation is identical with any other model in my experience) with the arguments: n-gpu-layers - 38, n_ctx - 24576, tensor_split - 50,50, flash_attn, tensorcores;
Use SillyTavern to connect to Chat Completion API using Custom (OpenAI-compatible) custom chat completion source;
Attempt to generate a response;
Screenshot
Logs
18:10:43-604590 INFO Starting Text generation web UI
18:10:43-606326 WARNING
You are potentially exposing the web UI to the entire internet without any access password.
You can create one with the "--gradio-auth" flag like this:
--gradio-auth username:password
Make sure to replace username:password with your own.
18:10:43-607234 INFO Loading the extension "openai"
18:10:43-651404 INFO OpenAI-compatible API URL:
http://0.0.0.0:5000
Running on local URL: http://0.0.0.0:7860
18:15:52-130068 INFO Loading "Luminum-v0.1-123B-Q4_K_M.gguf"
18:15:52-151573 INFO llama.cpp weights detected: "models/Luminum-v0.1-123B-Q4_K_M.gguf"
llama_model_loader: loaded meta data with 33 key-value pairs and 795 tensors from models/Luminum-v0.1-123B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Luminum v0.1 123B
llama_model_loader: - kv 3: general.basename str = Luminum-v0.1
llama_model_loader: - kv 4: general.size_label str = 123B
llama_model_loader: - kv 5: general.base_model.count u32 = 0
llama_model_loader: - kv 6: general.tags arr[str,2] = ["mergekit", "merge"]
llama_model_loader: - kv 7: llama.block_count u32 = 88
llama_model_loader: - kv 8: llama.context_length u32 = 131072
llama_model_loader: - kv 9: llama.embedding_length u32 = 12288
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 11: llama.attention.head_count u32 = 96
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.attention.key_length u32 = 128
llama_model_loader: - kv 16: llama.attention.value_length u32 = 128
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 32768
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 21: tokenizer.ggml.model str = llama
llama_model_loader: - kv 22: tokenizer.ggml.pre str = default
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,32768] = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,32768] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,32768] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 31: tokenizer.chat_template str = {%- if messages[0]["role"] == "system...
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - type f32: 177 tensors
llama_model_loader: - type q4_K: 529 tensors
llama_model_loader: - type q6_K: 89 tensors
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1732 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32768
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 12288
llm_load_print_meta: n_layer = 88
llm_load_print_meta: n_head = 96
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 12
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 122.61 B
llm_load_print_meta: model size = 68.19 GiB (4.78 BPW)
llm_load_print_meta: general.name = Luminum v0.1 123B
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 781 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 1.12 MiB
llm_load_tensors: offloading 38 repeating layers to GPU
llm_load_tensors: offloaded 38/89 layers to GPU
llm_load_tensors: CPU buffer size = 69826.92 MiB
llm_load_tensors: CUDA0 buffer size = 14737.31 MiB
llm_load_tensors: CUDA1 buffer size = 15185.91 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 24576
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 4800.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1824.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 1824.00 MiB
llama_new_context_with_model: KV self size = 8448.00 MiB, K (f16): 4224.00 MiB, V (f16): 4224.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 428.63 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 184.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 72.01 MiB
llama_new_context_with_model: graph nodes = 2471
llama_new_context_with_model: graph splits = 555
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Model metadata: {'tokenizer.chat_template': '{%- if messages[0]["role"] == "system" %}\n {%- set system_message = messages[0]["content"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr("role", "equalto", "user") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == "tool" or message.role == "tool_results" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message["role"] == "user") != (ns.index % 2 == 0) %}\n {{- raise_exception("After the optional system message, conversation roles must alternate user/assistant/user/assistant/...") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message["role"] == "user" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- "[AVAILABLE_TOOLS] [" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- \'{"type": "function", "function": {\' }}\n {%- for key, val in tool.items() if key != "return" %}\n {%- if val is string %}\n {{- \'"\' + key + \'": "\' + val + \'"\' }}\n {%- else %}\n {{- \'"\' + key + \'": \' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- ", " }}\n {%- endif %}\n {%- endfor %}\n {{- "}}" }}\n {%- if not loop.last %}\n {{- ", " }}\n {%- else %}\n {{- "]" }}\n {%- endif %}\n {%- endfor %}\n {{- "[/AVAILABLE_TOOLS]" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- "[INST] " + system_message + "\\n\\n" + message["content"] + "[/INST]" }}\n {%- else %}\n {{- "[INST] " + message["content"] + "[/INST]" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- "[TOOL_CALLS] [" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}\n {%- endif %}\n {{- \', "id": "\' + tool_call.id + \'"}\' }}\n {%- if not loop.last %}\n {{- ", " }}\n {%- else %}\n {{- "]" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message["role"] == "assistant" %}\n {{- " " + message["content"]|trim + eos_token}}\n {%- elif message["role"] == "tool_results" or message["role"] == "tool" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- \'[TOOL_RESULTS] {"content": \' + content|string + ", " }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}\n {%- endif %}\n {{- \'"call_id": "\' + message.tool_call_id + \'"}[/TOOL_RESULTS]\' }}\n {%- else %}\n {{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }}\n {%- endif %}\n{%- endfor %}\n', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.add_space_prefix': 'false', 'llama.rope.dimension_count': '128', 'llama.vocab_size': '32768', 'general.file_type': '15', 'llama.attention.value_length': '128', 'llama.attention.key_length': '128', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'general.basename': 'Luminum-v0.1', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '96', 'tokenizer.ggml.pre': 'default', 'llama.context_length': '131072', 'general.name': 'Luminum v0.1 123B', 'general.type': 'model', 'general.size_label': '123B', 'general.base_model.count': '0', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '12288', 'llama.feed_forward_length': '28672', 'llama.block_count': '88', 'llama.attention.head_count_kv': '8'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {%- if messages[0]["role"] == "system" %}
{%- set system_message = messages[0]["content"] %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set loop_messages = messages %}
{%- endif %}
{%- if not tools is defined %}
{%- set tools = none %}
{%- endif %}
{%- set user_messages = loop_messages | selectattr("role", "equalto", "user") | list %}
{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}
{%- set ns = namespace() %}
{%- set ns.index = 0 %}
{%- for message in loop_messages %}
{%- if not (message.role == "tool" or message.role == "tool_results" or (message.tool_calls is defined and message.tool_calls is not none)) %}
{%- if (message["role"] == "user") != (ns.index % 2 == 0) %}
{{- raise_exception("After the optional system message, conversation roles must alternate user/assistant/user/assistant/...") }}
{%- endif %}
{%- set ns.index = ns.index + 1 %}
{%- endif %}
{%- endfor %}
{{- bos_token }}
{%- for message in loop_messages %}
{%- if message["role"] == "user" %}
{%- if tools is not none and (message == user_messages[-1]) %}
{{- "[AVAILABLE_TOOLS] [" }}
{%- for tool in tools %}
{%- set tool = tool.function %}
{{- '{"type": "function", "function": {' }}
{%- for key, val in tool.items() if key != "return" %}
{%- if val is string %}
{{- '"' + key + '": "' + val + '"' }}
{%- else %}
{{- '"' + key + '": ' + val|tojson }}
{%- endif %}
{%- if not loop.last %}
{{- ", " }}
{%- endif %}
{%- endfor %}
{{- "}}" }}
{%- if not loop.last %}
{{- ", " }}
{%- else %}
{{- "]" }}
{%- endif %}
{%- endfor %}
{{- "[/AVAILABLE_TOOLS]" }}
{%- endif %}
{%- if loop.last and system_message is defined %}
{{- "[INST] " + system_message + "\n\n" + message["content"] + "[/INST]" }}
{%- else %}
{{- "[INST] " + message["content"] + "[/INST]" }}
{%- endif %}
{%- elif message.tool_calls is defined and message.tool_calls is not none %}
{{- "[TOOL_CALLS] [" }}
{%- for tool_call in message.tool_calls %}
{%- set out = tool_call.function|tojson %}
{{- out[:-1] }}
{%- if not tool_call.id is defined or tool_call.id|length != 9 %}
{{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}
{%- endif %}
{{- ', "id": "' + tool_call.id + '"}' }}
{%- if not loop.last %}
{{- ", " }}
{%- else %}
{{- "]" + eos_token }}
{%- endif %}
{%- endfor %}
{%- elif message["role"] == "assistant" %}
{{- " " + message["content"]|trim + eos_token}}
{%- elif message["role"] == "tool_results" or message["role"] == "tool" %}
{%- if message.content is defined and message.content.content is defined %}
{%- set content = message.content.content %}
{%- else %}
{%- set content = message.content %}
{%- endif %}
{{- '[TOOL_RESULTS] {"content": ' + content|string + ", " }}
{%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}
{{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}
{%- endif %}
{{- '"call_id": "' + message.tool_call_id + '"}[/TOOL_RESULTS]' }}
{%- else %}
{{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }}
{%- endif %}
{%- endfor %}
Using chat eos_token: </s>
Using chat bos_token: <s>
18:16:21-633387 INFO Loaded "Luminum-v0.1-123B-Q4_K_M.gguf" in 29.50 seconds.
18:16:21-634296 INFO LOADER: "llama.cpp"
18:16:21-634730 INFO TRUNCATION LENGTH: 24576
18:16:21-635139 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
CUDA error: invalid argument
current device: 1, in function ggml_backend_cuda_graph_compute at /home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-python-cuBLAS-wheels/vendor/llama.cpp/ggml/src/ggml-cuda.cu:2694
cudaGraphKernelNodeSetParams(cuda_ctx->cuda_graph->nodes[i], &cuda_ctx->cuda_graph->params[i])
/home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-python-cuBLAS-wheels/vendor/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
[New LWP 2769907]
[New LWP 2769971]
[New LWP 2769973]
[New LWP 2769974]
[New LWP 2769975]
[New LWP 2769976]
[New LWP 2769977]
[New LWP 2769978]
[New LWP 2769979]
[New LWP 2769980]
[New LWP 2769981]
[New LWP 2769982]
[New LWP 2769983]
[New LWP 2769984]
[New LWP 2769985]
[New LWP 2769986]
[New LWP 2769987]
[New LWP 2769988]
[New LWP 2769989]
[New LWP 2769990]
[New LWP 2769991]
[New LWP 2769997]
[New LWP 2770001]
[New LWP 2770002]
[New LWP 2770003]
[New LWP 2770004]
[New LWP 2770005]
[New LWP 2770006]
[New LWP 2770007]
[New LWP 2770008]
[New LWP 2770009]
[New LWP 2770010]
[New LWP 2770011]
[New LWP 2770012]
[New LWP 2770013]
[New LWP 2770014]
[New LWP 2770015]
[New LWP 2770016]
[New LWP 2770017]
[New LWP 2770018]
[New LWP 2770019]
[New LWP 2770020]
[New LWP 2770021]
[New LWP 2770022]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fae89751485 in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#0 0x00007fae89751485 in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x0000000000644f15 in pysleep (timeout=<optimized out>) at /usr/local/src/conda/python-3.11.9/Modules/timemodule.c:2159
2159 /usr/local/src/conda/python-3.11.9/Modules/timemodule.c: No such file or directory.
#2 time_sleep (self=<optimized out>, timeout_obj=<optimized out>) at /usr/local/src/conda/python-3.11.9/Modules/timemodule.c:383
383 in /usr/local/src/conda/python-3.11.9/Modules/timemodule.c
#3 0x0000000000511b16 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x8a7a38 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7fae89963020, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:5020
5020 /usr/local/src/conda/python-3.11.9/Python/ceval.c: No such file or directory.
#4 0x00000000005cbeda in _PyEval_EvalFrame (throwflag=0, frame=0x7fae89963020, tstate=0x8a7a38 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
73 /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h: No such file or directory.
#5 _PyEval_Vector (tstate=tstate@entry=0x8a7a38 <_PyRuntime+166328>, func=func@entry=0x7fae896187c0, locals=locals@entry=0x7fae89672240, args=args@entry=0x0, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
6434 /usr/local/src/conda/python-3.11.9/Python/ceval.c: No such file or directory.
#6 0x00000000005cb5af in PyEval_EvalCode (co=co@entry=0x14efcc0, globals=globals@entry=0x7fae89672240, locals=locals@entry=0x7fae89672240) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:1148
1148 in /usr/local/src/conda/python-3.11.9/Python/ceval.c
#7 0x00000000005ec6a7 in run_eval_code_obj (tstate=tstate@entry=0x8a7a38 <_PyRuntime+166328>, co=co@entry=0x14efcc0, globals=globals@entry=0x7fae89672240, locals=locals@entry=0x7fae89672240) at /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:1741
1741 /usr/local/src/conda/python-3.11.9/Python/pythonrun.c: No such file or directory.
#8 0x00000000005e8240 in run_mod (mod=mod@entry=0x1511490, filename=filename@entry=0x7fae895ad5a0, globals=globals@entry=0x7fae89672240, locals=locals@entry=0x7fae89672240, flags=flags@entry=0x7ffcbef00718, arena=arena@entry=0x7fae8959b670) at /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:1762
1762 in /usr/local/src/conda/python-3.11.9/Python/pythonrun.c
#9 0x00000000005fd192 in pyrun_file (fp=fp@entry=0x143f450, filename=filename@entry=0x7fae895ad5a0, start=start@entry=257, globals=globals@entry=0x7fae89672240, locals=locals@entry=0x7fae89672240, closeit=closeit@entry=1, flags=0x7ffcbef00718) at /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:1657
1657 in /usr/local/src/conda/python-3.11.9/Python/pythonrun.c
#10 0x00000000005fc55f in _PyRun_SimpleFileObject (fp=0x143f450, filename=0x7fae895ad5a0, closeit=1, flags=0x7ffcbef00718) at /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:440
440 in /usr/local/src/conda/python-3.11.9/Python/pythonrun.c
#11 0x00000000005fc283 in _PyRun_AnyFileObject (fp=0x143f450, filename=filename@entry=0x7fae895ad5a0, closeit=closeit@entry=1, flags=flags@entry=0x7ffcbef00718) at /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:79
79 in /usr/local/src/conda/python-3.11.9/Python/pythonrun.c
#12 0x00000000005f6efe in pymain_run_file_obj (skip_source_first_line=0, filename=0x7fae895ad5a0, program_name=0x7fae896732f0) at /usr/local/src/conda/python-3.11.9/Modules/main.c:360
360 /usr/local/src/conda/python-3.11.9/Modules/main.c: No such file or directory.
#13 pymain_run_file (config=0x88da80 <_PyRuntime+59904>) at /usr/local/src/conda/python-3.11.9/Modules/main.c:379
379 in /usr/local/src/conda/python-3.11.9/Modules/main.c
#14 pymain_run_python (exitcode=0x7ffcbef00710) at /usr/local/src/conda/python-3.11.9/Modules/main.c:601
601 in /usr/local/src/conda/python-3.11.9/Modules/main.c
#15 Py_RunMain () at /usr/local/src/conda/python-3.11.9/Modules/main.c:680
680 in /usr/local/src/conda/python-3.11.9/Modules/main.c
#16 0x00000000005bbc79 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.11.9/Modules/main.c:734
734 in /usr/local/src/conda/python-3.11.9/Modules/main.c
#17 0x00007fae896a924a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#18 0x00007fae896a9305 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#19 0x00000000005bbac3 in _start ()
[Inferior 1 (process 2769871) detached]
Aborted (core dumped)
Describe the bug
Inference fails after prompt evaluation with llama-cpp backend with error:
Is there an existing issue for this?
Reproduction
./start_linux.sh --listen --api
;Luminum-v0.1-123B-Q4_K_M.gguf
, but the situation is identical with any other model in my experience) with the arguments:n-gpu-layers - 38
,n_ctx - 24576
,tensor_split - 50,50
,flash_attn
,tensorcores
;Chat Completion
API usingCustom (OpenAI-compatible)
custom chat completion source;Screenshot
Logs
System Info