llama-cpp inference - CUDA error

Describe the bug

Inference fails after prompt evaluation with llama-cpp backend with error:

CUDA error: invalid argument
  current device: 1, in function ggml_backend_cuda_graph_compute at /home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-python-cuBLAS-wheels/vendor/llama.cpp/ggml/src/ggml-cuda.cu:2694
  cudaGraphKernelNodeSetParams(cuda_ctx->cuda_graph->nodes[i], &cuda_ctx->cuda_graph->params[i])
/home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-python-cuBLAS-wheels/vendor/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Clone the repo;
Execute ./start_linux.sh --listen --api;
Load a model converted and quantized with llama-cpp (in my particular case - Luminum-v0.1-123B-Q4_K_M.gguf, but the situation is identical with any other model in my experience) with the arguments: n-gpu-layers - 38, n_ctx - 24576, tensor_split - 50,50, flash_attn, tensorcores;
Use SillyTavern to connect to Chat Completion API using Custom (OpenAI-compatible) custom chat completion source;
Attempt to generate a response;

Screenshot

webui2 webui1

Logs

18:10:43-604590 INFO     Starting Text generation web UI
18:10:43-606326 WARNING
                         You are potentially exposing the web UI to the entire internet without any access password.
                         You can create one with the "--gradio-auth" flag like this:

                         --gradio-auth username:password

                         Make sure to replace username:password with your own.
18:10:43-607234 INFO     Loading the extension "openai"
18:10:43-651404 INFO     OpenAI-compatible API URL:

                         http://0.0.0.0:5000

Running on local URL:  http://0.0.0.0:7860

18:15:52-130068 INFO     Loading "Luminum-v0.1-123B-Q4_K_M.gguf"
18:15:52-151573 INFO     llama.cpp weights detected: "models/Luminum-v0.1-123B-Q4_K_M.gguf"
llama_model_loader: loaded meta data with 33 key-value pairs and 795 tensors from models/Luminum-v0.1-123B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Luminum v0.1 123B
llama_model_loader: - kv   3:                           general.basename str              = Luminum-v0.1
llama_model_loader: - kv   4:                         general.size_label str              = 123B
llama_model_loader: - kv   5:                   general.base_model.count u32              = 0
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["mergekit", "merge"]
llama_model_loader: - kv   7:                          llama.block_count u32              = 88
llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 12288
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 96
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  16:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 32768
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,32768]   = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  24:                      tokenizer.ggml.scores arr[f32,32768]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,32768]   = [3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {%- if messages[0]["role"] == "system...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  177 tensors
llama_model_loader: - type q4_K:  529 tensors
llama_model_loader: - type q6_K:   89 tensors
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1732 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32768
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 12288
llm_load_print_meta: n_layer          = 88
llm_load_print_meta: n_head           = 96
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 12
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 122.61 B
llm_load_print_meta: model size       = 68.19 GiB (4.78 BPW)
llm_load_print_meta: general.name     = Luminum v0.1 123B
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 781 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    1.12 MiB
llm_load_tensors: offloading 38 repeating layers to GPU
llm_load_tensors: offloaded 38/89 layers to GPU
llm_load_tensors:        CPU buffer size = 69826.92 MiB
llm_load_tensors:      CUDA0 buffer size = 14737.31 MiB
llm_load_tensors:      CUDA1 buffer size = 15185.91 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 24576
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  4800.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1824.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  1824.00 MiB
llama_new_context_with_model: KV self size  = 8448.00 MiB, K (f16): 4224.00 MiB, V (f16): 4224.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   428.63 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   184.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    72.01 MiB
llama_new_context_with_model: graph nodes  = 2471
llama_new_context_with_model: graph splits = 555
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Model metadata: {'tokenizer.chat_template': '{%- if messages[0]["role"] == "system" %}\n    {%- set system_message = messages[0]["content"] %}\n    {%- set loop_messages = messages[1:] %}\n{%- else %}\n    {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr("role", "equalto", "user") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n    {%- if not (message.role == "tool" or message.role == "tool_results" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n        {%- if (message["role"] == "user") != (ns.index % 2 == 0) %}\n            {{- raise_exception("After the optional system message, conversation roles must alternate user/assistant/user/assistant/...") }}\n        {%- endif %}\n        {%- set ns.index = ns.index + 1 %}\n    {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n    {%- if message["role"] == "user" %}\n        {%- if tools is not none and (message == user_messages[-1]) %}\n            {{- "[AVAILABLE_TOOLS] [" }}\n            {%- for tool in tools %}\n                {%- set tool = tool.function %}\n                {{- \'{"type": "function", "function": {\' }}\n                {%- for key, val in tool.items() if key != "return" %}\n                    {%- if val is string %}\n                        {{- \'"\' + key + \'": "\' + val + \'"\' }}\n                    {%- else %}\n                        {{- \'"\' + key + \'": \' + val|tojson }}\n                    {%- endif %}\n                    {%- if not loop.last %}\n                        {{- ", " }}\n                    {%- endif %}\n                {%- endfor %}\n                {{- "}}" }}\n                {%- if not loop.last %}\n                    {{- ", " }}\n                {%- else %}\n                    {{- "]" }}\n                {%- endif %}\n            {%- endfor %}\n            {{- "[/AVAILABLE_TOOLS]" }}\n            {%- endif %}\n        {%- if loop.last and system_message is defined %}\n            {{- "[INST] " + system_message + "\\n\\n" + message["content"] + "[/INST]" }}\n        {%- else %}\n            {{- "[INST] " + message["content"] + "[/INST]" }}\n        {%- endif %}\n    {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n        {{- "[TOOL_CALLS] [" }}\n        {%- for tool_call in message.tool_calls %}\n            {%- set out = tool_call.function|tojson %}\n            {{- out[:-1] }}\n            {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n                {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}\n            {%- endif %}\n            {{- \', "id": "\' + tool_call.id + \'"}\' }}\n            {%- if not loop.last %}\n                {{- ", " }}\n            {%- else %}\n                {{- "]" + eos_token }}\n            {%- endif %}\n        {%- endfor %}\n    {%- elif message["role"] == "assistant" %}\n        {{- " " + message["content"]|trim + eos_token}}\n    {%- elif message["role"] == "tool_results" or message["role"] == "tool" %}\n        {%- if message.content is defined and message.content.content is defined %}\n            {%- set content = message.content.content %}\n        {%- else %}\n            {%- set content = message.content %}\n        {%- endif %}\n        {{- \'[TOOL_RESULTS] {"content": \' + content|string + ", " }}\n        {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n            {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}\n        {%- endif %}\n        {{- \'"call_id": "\' + message.tool_call_id + \'"}[/TOOL_RESULTS]\' }}\n    {%- else %}\n        {{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }}\n    {%- endif %}\n{%- endfor %}\n', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.add_space_prefix': 'false', 'llama.rope.dimension_count': '128', 'llama.vocab_size': '32768', 'general.file_type': '15', 'llama.attention.value_length': '128', 'llama.attention.key_length': '128', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'general.basename': 'Luminum-v0.1', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '96', 'tokenizer.ggml.pre': 'default', 'llama.context_length': '131072', 'general.name': 'Luminum v0.1 123B', 'general.type': 'model', 'general.size_label': '123B', 'general.base_model.count': '0', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '12288', 'llama.feed_forward_length': '28672', 'llama.block_count': '88', 'llama.attention.head_count_kv': '8'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {%- if messages[0]["role"] == "system" %}
    {%- set system_message = messages[0]["content"] %}
    {%- set loop_messages = messages[1:] %}
{%- else %}
    {%- set loop_messages = messages %}
{%- endif %}
{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}
{%- set user_messages = loop_messages | selectattr("role", "equalto", "user") | list %}

{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}
{%- set ns = namespace() %}
{%- set ns.index = 0 %}
{%- for message in loop_messages %}
    {%- if not (message.role == "tool" or message.role == "tool_results" or (message.tool_calls is defined and message.tool_calls is not none)) %}
        {%- if (message["role"] == "user") != (ns.index % 2 == 0) %}
            {{- raise_exception("After the optional system message, conversation roles must alternate user/assistant/user/assistant/...") }}
        {%- endif %}
        {%- set ns.index = ns.index + 1 %}
    {%- endif %}
{%- endfor %}

{{- bos_token }}
{%- for message in loop_messages %}
    {%- if message["role"] == "user" %}
        {%- if tools is not none and (message == user_messages[-1]) %}
            {{- "[AVAILABLE_TOOLS] [" }}
            {%- for tool in tools %}
                {%- set tool = tool.function %}
                {{- '{"type": "function", "function": {' }}
                {%- for key, val in tool.items() if key != "return" %}
                    {%- if val is string %}
                        {{- '"' + key + '": "' + val + '"' }}
                    {%- else %}
                        {{- '"' + key + '": ' + val|tojson }}
                    {%- endif %}
                    {%- if not loop.last %}
                        {{- ", " }}
                    {%- endif %}
                {%- endfor %}
                {{- "}}" }}
                {%- if not loop.last %}
                    {{- ", " }}
                {%- else %}
                    {{- "]" }}
                {%- endif %}
            {%- endfor %}
            {{- "[/AVAILABLE_TOOLS]" }}
            {%- endif %}
        {%- if loop.last and system_message is defined %}
            {{- "[INST] " + system_message + "\n\n" + message["content"] + "[/INST]" }}
        {%- else %}
            {{- "[INST] " + message["content"] + "[/INST]" }}
        {%- endif %}
    {%- elif message.tool_calls is defined and message.tool_calls is not none %}
        {{- "[TOOL_CALLS] [" }}
        {%- for tool_call in message.tool_calls %}
            {%- set out = tool_call.function|tojson %}
            {{- out[:-1] }}
            {%- if not tool_call.id is defined or tool_call.id|length != 9 %}
                {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}
            {%- endif %}
            {{- ', "id": "' + tool_call.id + '"}' }}
            {%- if not loop.last %}
                {{- ", " }}
            {%- else %}
                {{- "]" + eos_token }}
            {%- endif %}
        {%- endfor %}
    {%- elif message["role"] == "assistant" %}
        {{- " " + message["content"]|trim + eos_token}}
    {%- elif message["role"] == "tool_results" or message["role"] == "tool" %}
        {%- if message.content is defined and message.content.content is defined %}
            {%- set content = message.content.content %}
        {%- else %}
            {%- set content = message.content %}
        {%- endif %}
        {{- '[TOOL_RESULTS] {"content": ' + content|string + ", " }}
        {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}
            {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}
        {%- endif %}
        {{- '"call_id": "' + message.tool_call_id + '"}[/TOOL_RESULTS]' }}
    {%- else %}
        {{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }}
    {%- endif %}
{%- endfor %}

Using chat eos_token: </s>
Using chat bos_token: <s>
18:16:21-633387 INFO     Loaded "Luminum-v0.1-123B-Q4_K_M.gguf" in 29.50 seconds.
18:16:21-634296 INFO     LOADER: "llama.cpp"
18:16:21-634730 INFO     TRUNCATION LENGTH: 24576
18:16:21-635139 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
CUDA error: invalid argument
  current device: 1, in function ggml_backend_cuda_graph_compute at /home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-python-cuBLAS-wheels/vendor/llama.cpp/ggml/src/ggml-cuda.cu:2694
  cudaGraphKernelNodeSetParams(cuda_ctx->cuda_graph->nodes[i], &cuda_ctx->cuda_graph->params[i])
/home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-python-cuBLAS-wheels/vendor/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
[New LWP 2769907]
[New LWP 2769971]
[New LWP 2769973]
[New LWP 2769974]
[New LWP 2769975]
[New LWP 2769976]
[New LWP 2769977]
[New LWP 2769978]
[New LWP 2769979]
[New LWP 2769980]
[New LWP 2769981]
[New LWP 2769982]
[New LWP 2769983]
[New LWP 2769984]
[New LWP 2769985]
[New LWP 2769986]
[New LWP 2769987]
[New LWP 2769988]
[New LWP 2769989]
[New LWP 2769990]
[New LWP 2769991]
[New LWP 2769997]
[New LWP 2770001]
[New LWP 2770002]
[New LWP 2770003]
[New LWP 2770004]
[New LWP 2770005]
[New LWP 2770006]
[New LWP 2770007]
[New LWP 2770008]
[New LWP 2770009]
[New LWP 2770010]
[New LWP 2770011]
[New LWP 2770012]
[New LWP 2770013]
[New LWP 2770014]
[New LWP 2770015]
[New LWP 2770016]
[New LWP 2770017]
[New LWP 2770018]
[New LWP 2770019]
[New LWP 2770020]
[New LWP 2770021]
[New LWP 2770022]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fae89751485 in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007fae89751485 in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000000644f15 in pysleep (timeout=<optimized out>) at /usr/local/src/conda/python-3.11.9/Modules/timemodule.c:2159
2159    /usr/local/src/conda/python-3.11.9/Modules/timemodule.c: No such file or directory.
#2  time_sleep (self=<optimized out>, timeout_obj=<optimized out>) at /usr/local/src/conda/python-3.11.9/Modules/timemodule.c:383
383     in /usr/local/src/conda/python-3.11.9/Modules/timemodule.c
#3  0x0000000000511b16 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x8a7a38 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7fae89963020, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:5020
5020    /usr/local/src/conda/python-3.11.9/Python/ceval.c: No such file or directory.
#4  0x00000000005cbeda in _PyEval_EvalFrame (throwflag=0, frame=0x7fae89963020, tstate=0x8a7a38 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
73      /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h: No such file or directory.
#5  _PyEval_Vector (tstate=tstate@entry=0x8a7a38 <_PyRuntime+166328>, func=func@entry=0x7fae896187c0, locals=locals@entry=0x7fae89672240, args=args@entry=0x0, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
6434    /usr/local/src/conda/python-3.11.9/Python/ceval.c: No such file or directory.
#6  0x00000000005cb5af in PyEval_EvalCode (co=co@entry=0x14efcc0, globals=globals@entry=0x7fae89672240, locals=locals@entry=0x7fae89672240) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:1148
1148    in /usr/local/src/conda/python-3.11.9/Python/ceval.c
#7  0x00000000005ec6a7 in run_eval_code_obj (tstate=tstate@entry=0x8a7a38 <_PyRuntime+166328>, co=co@entry=0x14efcc0, globals=globals@entry=0x7fae89672240, locals=locals@entry=0x7fae89672240) at /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:1741
1741    /usr/local/src/conda/python-3.11.9/Python/pythonrun.c: No such file or directory.
#8  0x00000000005e8240 in run_mod (mod=mod@entry=0x1511490, filename=filename@entry=0x7fae895ad5a0, globals=globals@entry=0x7fae89672240, locals=locals@entry=0x7fae89672240, flags=flags@entry=0x7ffcbef00718, arena=arena@entry=0x7fae8959b670) at /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:1762
1762    in /usr/local/src/conda/python-3.11.9/Python/pythonrun.c
#9  0x00000000005fd192 in pyrun_file (fp=fp@entry=0x143f450, filename=filename@entry=0x7fae895ad5a0, start=start@entry=257, globals=globals@entry=0x7fae89672240, locals=locals@entry=0x7fae89672240, closeit=closeit@entry=1, flags=0x7ffcbef00718) at /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:1657
1657    in /usr/local/src/conda/python-3.11.9/Python/pythonrun.c
#10 0x00000000005fc55f in _PyRun_SimpleFileObject (fp=0x143f450, filename=0x7fae895ad5a0, closeit=1, flags=0x7ffcbef00718) at /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:440
440     in /usr/local/src/conda/python-3.11.9/Python/pythonrun.c
#11 0x00000000005fc283 in _PyRun_AnyFileObject (fp=0x143f450, filename=filename@entry=0x7fae895ad5a0, closeit=closeit@entry=1, flags=flags@entry=0x7ffcbef00718) at /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:79
79      in /usr/local/src/conda/python-3.11.9/Python/pythonrun.c
#12 0x00000000005f6efe in pymain_run_file_obj (skip_source_first_line=0, filename=0x7fae895ad5a0, program_name=0x7fae896732f0) at /usr/local/src/conda/python-3.11.9/Modules/main.c:360
360     /usr/local/src/conda/python-3.11.9/Modules/main.c: No such file or directory.
#13 pymain_run_file (config=0x88da80 <_PyRuntime+59904>) at /usr/local/src/conda/python-3.11.9/Modules/main.c:379
379     in /usr/local/src/conda/python-3.11.9/Modules/main.c
#14 pymain_run_python (exitcode=0x7ffcbef00710) at /usr/local/src/conda/python-3.11.9/Modules/main.c:601
601     in /usr/local/src/conda/python-3.11.9/Modules/main.c
#15 Py_RunMain () at /usr/local/src/conda/python-3.11.9/Modules/main.c:680
680     in /usr/local/src/conda/python-3.11.9/Modules/main.c
#16 0x00000000005bbc79 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.11.9/Modules/main.c:734
734     in /usr/local/src/conda/python-3.11.9/Modules/main.c
#17 0x00007fae896a924a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#18 0x00007fae896a9305 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#19 0x00000000005bbac3 in _start ()
[Inferior 1 (process 2769871) detached]
Aborted (core dumped)

System Info

       _,met$$$$$gg.          <user>@<machine>
    ,g$$$$$$$$$$$$$$$P.       ------------------
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 12 (bookworm) x86_64
 ,$$P'              `$$$.     Kernel: 6.1.0-18-amd64
',$$P       ,ggs.     `$$b:   Uptime: 191 days, 4 hours, 12 mins
`d$$'     ,$P"'   .    $$$    Packages: 2515 (dpkg)
 $$P      d$'     ,    $$P    Shell: fish 3.6.0
 $$:      $$.   -    ,d$$'    Terminal: /dev/pts/0
 $$;      Y$b._   _,d$P'      CPU: AMD Ryzen 9 7950X (32) @ 4.500GHz
 Y$$.    `.`"Y$$$$P"'         GPU: NVIDIA GeForce RTX 4090
 `$$b      "-.__              GPU: NVIDIA GeForce RTX 4090
  `Y$$                        GPU: AMD ATI 6c:00.0 Raphael
   `Y$$.                      Memory: 949MiB / 63424MiB
     `$$b.
       `Y$$b.
          `"Y$b._
              `"""

oobabooga / text-generation-webui

llama-cpp inference - CUDA error #6398

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info