ggozad / oterm

a text-based terminal client for Ollama
MIT License
1.04k stars 61 forks source link

Regenerate last ollama message does not work as expacted #116

Closed mcDandy closed 1 month ago

mcDandy commented 1 month ago

It ignores last 3 messages, resulting in reply to previous message I send (I send A get B. I send C get D. I send E get F. Click regenerate message. Get variation of D).

chat-2-gemma227b-regenerate_broken.md

The regeneration is not even saved in the conversation or database:

Stella's fingers flew across the keyboard, desperately trying to make sense of the garbled data stream. One moment there was a clear biological signature, faint but distinct, pulsing like a heartbeat from deep within the lunar surface. The next, silence. Nothing.

"Impossible," she muttered, her brow furrowed in concentration. "It can't just vanish."

She ran diagnostics on the probe, checked for interference, anything that could explain the sudden disappearance. But everything was functioning perfectly. The probe was still transmitting, its camera lens clear, but the source of the signal - whatever it was - had simply ceased to exist.

A chill crept down Stella's spine. This wasn't like a creature retreating into the shadows. It was as if it had never been there at all. A ghost in the machine, a phantom echo of life that left no trace.

The scientific part of her mind demanded logic, an explanation. But something deeper, primal and unsettling, whispered doubts. Had she glimpsed something beyond human comprehension? Something ancient and powerful, capable of existing outside the boundaries of time and space?

Stella stared at the blank screen, the cursor blinking mockingly. She had stumbled upon a mystery far greater than anything she could have imagined. And now it was gone, leaving her with more questions than answers.

The boredom was replaced by an unsettling unease. The sterile silence of the research station seemed to amplify the whispers in her mind. What had she awakened? And what would happen next?

mcDandy commented 1 month ago

Forgot ollama log if needed:

2024/09/12 23:54:07 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:F:\\Users\\danda\\OneDrive - Univerzita Pardubice\\Dokumenty\\LLMs OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\danda\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-09-12T23:54:07.886+02:00 level=INFO source=images.go:753 msg="total blobs: 16"
time=2024-09-12T23:54:07.898+02:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2024-09-12T23:54:07.899+02:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.10)"
time=2024-09-12T23:54:07.900+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v6.1]"
time=2024-09-12T23:54:07.900+02:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
time=2024-09-12T23:54:08.029+02:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-b24d85ba-d49c-e2a7-5451-a3f9f4a56b58 library=cuda variant=v12 compute=8.9 driver=12.5 name="NVIDIA GeForce RTX 4080 Laptop GPU" total="12.0 GiB" available="10.8 GiB"
[GIN] 2024/09/12 - 23:54:31 | 200 |            0s |       127.0.0.1 | HEAD     "/"
time=2024-09-12T23:54:42.906+02:00 level=INFO source=server.go:101 msg="system memory" total="31.7 GiB" free="20.5 GiB" free_swap="31.5 GiB"
time=2024-09-12T23:54:42.906+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 layers.offload=28 layers.split="" memory.available="[11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="17.4 GiB" memory.required.partial="10.9 GiB" memory.required.kv="736.0 MiB" memory.required.allocations="[10.9 GiB]" memory.weights.total="14.4 GiB" memory.weights.repeating="13.5 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="509.0 MiB" memory.graph.partial="1.4 GiB"
time=2024-09-12T23:54:42.915+02:00 level=INFO source=server.go:391 msg="starting llama server" cmd="C:\\Users\\danda\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model F:\\Users\\danda\\OneDrive - Univerzita Pardubice\\Dokumenty\\LLMs\\blobs\\sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 28 --no-mmap --parallel 1 --port 50739"
time=2024-09-12T23:54:42.985+02:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2024-09-12T23:54:42.985+02:00 level=INFO source=server.go:590 msg="waiting for llama runner to start responding"
time=2024-09-12T23:54:42.990+02:00 level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3661 commit="8962422b" tid="23264" timestamp=1726178083
INFO [wmain] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="23264" timestamp=1726178083 total_threads=32
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="50739" tid="23264" timestamp=1726178083
llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from F:\Users\danda\OneDrive - Univerzita Pardubice\Dokumenty\LLMs\blobs\sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.name str              = gemma-2-27b-it
llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 4608
llama_model_loader: - kv   4:                         gemma2.block_count u32              = 46
llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 36864
llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 32
llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 128
llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 128
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  185 tensors
llama_model_loader: - type q4_0:  322 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-09-12T23:54:43.513+02:00 level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 108
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4608
llm_load_print_meta: n_layer          = 46
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 36864
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 27B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 27.23 B
llm_load_print_meta: model size       = 14.55 GiB (4.59 BPW)
llm_load_print_meta: general.name     = gemma-2-27b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.45 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloaded 28/47 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  7314.49 MiB
llm_load_tensors:      CUDA0 buffer size =  8506.97 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   288.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   448.00 MiB
llama_new_context_with_model: KV self size  =  736.00 MiB, K (f16):  368.00 MiB, V (f16):  368.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1431.85 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    17.01 MiB
llama_new_context_with_model: graph nodes  = 1850
llama_new_context_with_model: graph splits = 238
INFO [wmain] model loaded | tid="23264" timestamp=1726178106
time=2024-09-12T23:55:06.887+02:00 level=INFO source=server.go:629 msg="llama runner started in 23.90 seconds"
[GIN] 2024/09/12 - 23:56:37 | 200 |         1m54s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/09/12 - 23:59:34 | 200 |         1m34s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/09/13 - 00:01:25 | 200 |         1m37s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/09/13 - 00:02:47 | 200 |         1m21s |       127.0.0.1 | POST     "/api/chat"
ggozad commented 1 month ago

I am having trouble reproducing this, it seems to be working fine for me. Is it possible that for instance you interrupt inference and regenerate before the llm has completed its response?

mcDandy commented 1 month ago

Not sure what happened yesterday... Probably aftermath of nvidia GPU driver problems. (ollama has problems with game ready 361.09). Also cannot replicate.

Is it possible that for instance you interrupt inference and regenerate before the llm has completed its response? Probably? Sometimes I had problem that GPU ran at 25% but no tokens were being outputted for minutes (while normally I get 3t/s)...

ggozad commented 1 month ago

Try to see if you can reproduce otherwise just close the ticket please