ollama / ollama

Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
https://ollama.com
MIT License
100.65k stars 8.02k forks source link

ollama can't run qwen:72b, error msg ""gpu VRAM usage didn't recover within timeout #4427

Closed changingshow closed 6 months ago

changingshow commented 6 months ago

What is the issue?

I have already downloaded qwen:7b, but when i run ollama run qwen:7b,got this error Error: timed out waiting for llama runner to start:, in the server.log have this msg gpu VRAM usage didn't recover within timeout

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.1.37

chrisoutwright commented 6 months ago

I also get it (with smaller Params) when running RTX 2080 TI OR GTX 1060 with codeqwen:chat and codegamma:instruct for Win10. It stays in RAM but will have to copy to GPU RAM everytime after one chat POST.

The GPU RAM is not exceeding on them, so not sure why it timesout every time.

dhiltgen commented 6 months ago

This was fixed in 0.1.38 via pr #4430. If you're still seeing problems after upgrading, please share your server log and I'll reopen.

pamanseau commented 6 months ago

@dhiltgen I have the same issue with 0.1.38 with Linux ollama.log

dhiltgen commented 6 months ago

@pamanseau from the logs you shared, it looks like the client gave up before the model finished loading, and since the client request was canceled, we canceled the loading of the model. Are you using our CLI, or are you calling the API? If you're calling the API, what timeout are you setting in your client?

kirkster96 commented 6 months ago

@dhiltgen

I also have the same issue using: ollama/ollama:latest on Docker I notice that this occurs after I successfully run a prompt and then let it idle. When I come back, I have to wait for the llama runner started in... message. ollama.log

Edit:

Turns out that I did not specify a keep-alive and the default is 5 minutes! Never mind! 😁 @pamanseau @dhiltgen How do I keep a model loaded in memory or make it unload immediately?

pamanseau commented 6 months ago

@pamanseau from the logs you shared, it looks like the client gave up before the model finished loading, and since the client request was canceled, we canceled the loading of the model. Are you using our CLI, or are you calling the API? If you're calling the API, what timeout are you setting in your client?

We don't set a specific timeout. using the API or the WebUI. Perhaps the default timeout is too small.

axel7083 commented 6 months ago

I am facing the same issue in version v0.1.41 when trying to set PARAMETER num_gpu 33. Got the following

[GIN] 2024/06/04 - 11:07:42 | 200 |         3m23s |       10.88.0.1 | POST     "/v1/chat/completions"
time=2024-06-04T11:12:49.354Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=6.957608932
time=2024-06-04T11:12:51.595Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=9.198405481
time=2024-06-04T11:12:53.757Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=11.359973211
dhiltgen commented 6 months ago

@axel7083 look earlier in your logs. My suspicion is the GPU runner crashed (perhaps 33 layers was too much for your GPU), we fell back to CPU, and there's a minor bug where we mistakenly try to verify VRAM recovery on unload even though the runner was CPU based.

dhiltgen commented 6 months ago

@pamanseau can you try to repro with the CLI to rule out client timeouts?

abstract-entity commented 6 months ago

Hello got the same here with a RTX 4090 GPU

2024-06-05 09:56:37 time=2024-06-05T07:56:37.849Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.10060701 2024-06-05 09:56:38 time=2024-06-05T07:56:38.099Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.350123004 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama2435727696/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 36319" 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=sched.go:338 msg="loaded runners" count=1 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding" 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" 2024-06-05 09:56:38 time=2024-06-05T07:56:38.349Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.600819928 2024-06-05 09:56:38 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) 2024-06-05 09:56:38 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-06-05 09:56:38 llama_model_loader: - kv 0: general.architecture str = llama 2024-06-05 09:56:32 [GIN] 2024/06/05 - 07:56:32 | 200 | 11.239µs | 172.18.0.3 | HEAD "/" 2024-06-05 09:56:38 INFO [main] build info | build=1 commit="5921b8f" tid="139796719271936" timestamp=1717574198 2024-06-05 09:56:38 INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139796719271936" timestamp=1717574198 total_threads=32 2024-06-05 09:56:38 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="36319" tid="139796719271936" timestamp=1717574198 2024-06-05 09:56:38 llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct 2024-06-05 09:56:38 llama_model_loader: - kv 2: llama.block_count u32 = 32 2024-06-05 09:56:38 llama_model_loader: - kv 3: llama.context_length u32 = 8192 2024-06-05 09:56:38 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 2024-06-05 09:56:38 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 2024-06-05 09:56:38 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 2024-06-05 09:56:38 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 2024-06-05 09:56:38 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 2024-06-05 09:56:38 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 2024-06-05 09:56:38 llama_model_loader: - kv 10: general.file_type u32 = 2 2024-06-05 09:56:38 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 2024-06-05 09:56:38 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 2024-06-05 09:56:38 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 2024-06-05 09:56:38 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe 2024-06-05 09:56:38 llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-06-05 09:56:38 llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-06-05 09:56:38 llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 2024-06-05 09:56:38 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 2024-06-05 09:56:38 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 2024-06-05 09:56:38 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... 2024-06-05 09:56:38 llama_model_loader: - kv 21: general.quantization_version u32 = 2 2024-06-05 09:56:38 llama_model_loader: - type f32: 65 tensors 2024-06-05 09:56:38 llama_model_loader: - type q4_0: 225 tensors 2024-06-05 09:56:38 llama_model_loader: - type q6_K: 1 tensors 2024-06-05 09:56:38 time=2024-06-05T07:56:38.590Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model" 2024-06-05 09:56:38 llm_load_vocab: special tokens cache size = 256 2024-06-05 09:56:38 llm_load_vocab: token to piece cache size = 1.5928 MB 2024-06-05 09:56:38 llm_load_print_meta: format = GGUF V3 (latest) 2024-06-05 09:56:38 llm_load_print_meta: arch = llama 2024-06-05 09:56:38 llm_load_print_meta: vocab type = BPE 2024-06-05 09:56:38 llm_load_print_meta: n_vocab = 128256 2024-06-05 09:56:38 llm_load_print_meta: n_merges = 280147 2024-06-05 09:56:38 llm_load_print_meta: n_ctx_train = 8192 2024-06-05 09:56:38 llm_load_print_meta: n_embd = 4096 2024-06-05 09:56:38 llm_load_print_meta: n_head = 32 2024-06-05 09:56:38 llm_load_print_meta: n_head_kv = 8 2024-06-05 09:56:38 llm_load_print_meta: n_layer = 32 2024-06-05 09:56:38 llm_load_print_meta: n_rot = 128 2024-06-05 09:56:38 llm_load_print_meta: n_embd_head_k = 128 2024-06-05 09:56:38 llm_load_print_meta: n_embd_head_v = 128 2024-06-05 09:56:38 llm_load_print_meta: n_gqa = 4 2024-06-05 09:56:38 llm_load_print_meta: n_embd_k_gqa = 1024 2024-06-05 09:56:38 llm_load_print_meta: n_embd_v_gqa = 1024 2024-06-05 09:56:38 llm_load_print_meta: f_norm_eps = 0.0e+00 2024-06-05 09:56:38 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 2024-06-05 09:56:38 llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-06-05 09:56:38 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-06-05 09:56:38 llm_load_print_meta: f_logit_scale = 0.0e+00 2024-06-05 09:56:38 llm_load_print_meta: n_ff = 14336 2024-06-05 09:56:38 llm_load_print_meta: n_expert = 0 2024-06-05 09:56:38 llm_load_print_meta: n_expert_used = 0 2024-06-05 09:56:38 llm_load_print_meta: causal attn = 1 2024-06-05 09:56:38 llm_load_print_meta: pooling type = 0 2024-06-05 09:56:38 llm_load_print_meta: rope type = 0 2024-06-05 09:56:38 llm_load_print_meta: rope scaling = linear 2024-06-05 09:56:38 llm_load_print_meta: freq_base_train = 500000.0 2024-06-05 09:56:38 llm_load_print_meta: freq_scale_train = 1 2024-06-05 09:56:38 llm_load_print_meta: n_yarn_orig_ctx = 8192 2024-06-05 09:56:38 llm_load_print_meta: rope_finetuned = unknown 2024-06-05 09:56:38 llm_load_print_meta: ssm_d_conv = 0 2024-06-05 09:56:38 llm_load_print_meta: ssm_d_inner = 0 2024-06-05 09:56:38 llm_load_print_meta: ssm_d_state = 0 2024-06-05 09:56:38 llm_load_print_meta: ssm_dt_rank = 0 2024-06-05 09:56:38 llm_load_print_meta: model type = 8B 2024-06-05 09:56:38 llm_load_print_meta: model ftype = Q4_0 2024-06-05 09:56:38 llm_load_print_meta: model params = 8.03 B 2024-06-05 09:56:38 llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) 2024-06-05 09:56:38 llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct 2024-06-05 09:56:38 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' 2024-06-05 09:56:38 llm_load_print_meta: EOS token = 128009 '<|eot_id|>' 2024-06-05 09:56:38 llm_load_print_meta: LF token = 128 'Ä' 2024-06-05 09:56:38 llm_load_print_meta: EOT token = 128009 '<|eot_id|>' 2024-06-05 09:56:38 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes 2024-06-05 09:56:38 ggml_cuda_init: CUDA_USE_TENSOR_CORES: no 2024-06-05 09:56:38 ggml_cuda_init: found 1 CUDA devices: 2024-06-05 09:56:38 Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes 2024-06-05 09:56:38 llm_load_tensors: ggml ctx size = 0.30 MiB 2024-06-05 09:56:39 llm_load_tensors: offloading 32 repeating layers to GPU 2024-06-05 09:56:39 llm_load_tensors: offloading non-repeating layers to GPU 2024-06-05 09:56:39 llm_load_tensors: offloaded 33/33 layers to GPU 2024-06-05 09:56:39 llm_load_tensors: CPU buffer size = 281.81 MiB 2024-06-05 09:56:39 llm_load_tensors: CUDA0 buffer size = 4155.99 MiB 2024-06-05 09:56:39 llama_new_context_with_model: n_ctx = 2048 2024-06-05 09:56:39 llama_new_context_with_model: n_batch = 512 2024-06-05 09:56:39 llama_new_context_with_model: n_ubatch = 512 2024-06-05 09:56:39 llama_new_context_with_model: flash_attn = 0 2024-06-05 09:56:39 llama_new_context_with_model: freq_base = 500000.0 2024-06-05 09:56:39 llama_new_context_with_model: freq_scale = 1 2024-06-05 09:56:39 llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB 2024-06-05 09:56:39 llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB 2024-06-05 09:56:39 llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB 2024-06-05 09:56:39 llama_new_context_with_model: CUDA0 compute buffer size = 258.50 MiB 2024-06-05 09:56:39 llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB 2024-06-05 09:56:39 llama_new_context_with_model: graph nodes = 1030 2024-06-05 09:56:39 llama_new_context_with_model: graph splits = 2 2024-06-05 09:56:40 time=2024-06-05T07:56:40.095Z level=INFO source=server.go:572 msg="llama runner started in 1.76 seconds" 2024-06-05 09:56:45 time=2024-06-05T07:56:45.546Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.076150155 2024-06-05 09:56:45 time=2024-06-05T07:56:45.796Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3259817720000004 2024-06-05 09:56:46 time=2024-06-05T07:56:46.042Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" 2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" 2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama2435727696/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --mlock --parallel 1 --port 41889" 2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=sched.go:338 msg="loaded runners" count=1 2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding" 2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" 2024-06-05 09:56:46 time=2024-06-05T07:56:46.045Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.575500551 2024-06-05 09:56:46 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) 2024-06-05 09:56:46 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-06-05 09:56:46 llama_model_loader: - kv 0: general.architecture str = llama 2024-06-05 09:56:46 llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct 2024-06-05 09:56:46 llama_model_loader: - kv 2: llama.block_count u32 = 32 2024-06-05 09:56:46 llama_model_loader: - kv 3: llama.context_length u32 = 8192 2024-06-05 09:56:46 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 2024-06-05 09:56:46 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 2024-06-05 09:56:46 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 2024-06-05 09:56:46 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 2024-06-05 09:56:46 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 2024-06-05 09:56:46 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 2024-06-05 09:56:46 llama_model_loader: - kv 10: general.file_type u32 = 2 2024-06-05 09:56:46 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 2024-06-05 09:56:46 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 2024-06-05 09:56:46 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 2024-06-05 09:56:46 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe 2024-06-05 09:56:46 llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-06-05 09:56:46 llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-06-05 09:56:46 llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 2024-06-05 09:56:46 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 2024-06-05 09:56:46 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 2024-06-05 09:56:46 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... 2024-06-05 09:56:46 llama_model_loader: - kv 21: general.quantization_version u32 = 2 2024-06-05 09:56:46 llama_model_loader: - type f32: 65 tensors 2024-06-05 09:56:39 INFO [main] model loaded | tid="139796719271936" timestamp=1717574199 2024-06-05 09:56:40 [GIN] 2024/06/05 - 07:56:40 | 200 | 7.453372923s | 172.18.0.3 | POST "/api/embeddings" 2024-06-05 09:56:46 INFO [main] build info | build=1 commit="5921b8f" tid="140222294765568" timestamp=1717574206 2024-06-05 09:56:46 INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140222294765568" timestamp=1717574206 total_threads=32 2024-06-05 09:56:46 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="41889" tid="140222294765568" timestamp=1717574206 2024-06-05 09:56:47 INFO [main] model loaded | tid="140222294765568" timestamp=1717574207 2024-06-05 09:56:50 [GIN] 2024/06/05 - 07:56:50 | 200 | 10.221225661s | 172.18.0.3 | POST "/api/chat" 2024-06-05 09:56:46 llama_model_loader: - type q4_0: 225 tensors 2024-06-05 09:56:46 llama_model_loader: - type q6_K: 1 tensors 2024-06-05 09:56:46 time=2024-06-05T07:56:46.294Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model" 2024-06-05 09:56:46 llm_load_vocab: special tokens cache size = 256 2024-06-05 09:56:46 llm_load_vocab: token to piece cache size = 1.5928 MB 2024-06-05 09:56:46 llm_load_print_meta: format = GGUF V3 (latest) 2024-06-05 09:56:46 llm_load_print_meta: arch = llama 2024-06-05 09:56:46 llm_load_print_meta: vocab type = BPE 2024-06-05 09:56:46 llm_load_print_meta: n_vocab = 128256 2024-06-05 09:56:46 llm_load_print_meta: n_merges = 280147 2024-06-05 09:56:46 llm_load_print_meta: n_ctx_train = 8192 2024-06-05 09:56:46 llm_load_print_meta: n_embd = 4096 2024-06-05 09:56:46 llm_load_print_meta: n_head = 32 2024-06-05 09:56:46 llm_load_print_meta: n_head_kv = 8 2024-06-05 09:56:46 llm_load_print_meta: n_layer = 32 2024-06-05 09:56:46 llm_load_print_meta: n_rot = 128 2024-06-05 09:56:46 llm_load_print_meta: n_embd_head_k = 128 2024-06-05 09:56:46 llm_load_print_meta: n_embd_head_v = 128 2024-06-05 09:56:46 llm_load_print_meta: n_gqa = 4 2024-06-05 09:56:46 llm_load_print_meta: n_embd_k_gqa = 1024 2024-06-05 09:56:46 llm_load_print_meta: n_embd_v_gqa = 1024 2024-06-05 09:56:46 llm_load_print_meta: f_norm_eps = 0.0e+00 2024-06-05 09:56:46 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 2024-06-05 09:56:46 llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-06-05 09:56:46 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-06-05 09:56:46 llm_load_print_meta: f_logit_scale = 0.0e+00 2024-06-05 09:56:46 llm_load_print_meta: n_ff = 14336 2024-06-05 09:56:46 llm_load_print_meta: n_expert = 0 2024-06-05 09:56:46 llm_load_print_meta: n_expert_used = 0 2024-06-05 09:56:46 llm_load_print_meta: causal attn = 1 2024-06-05 09:56:46 llm_load_print_meta: pooling type = 0 2024-06-05 09:56:46 llm_load_print_meta: rope type = 0 2024-06-05 09:56:46 llm_load_print_meta: rope scaling = linear 2024-06-05 09:56:46 llm_load_print_meta: freq_base_train = 500000.0 2024-06-05 09:56:46 llm_load_print_meta: freq_scale_train = 1 2024-06-05 09:56:46 llm_load_print_meta: n_yarn_orig_ctx = 8192 2024-06-05 09:56:46 llm_load_print_meta: rope_finetuned = unknown 2024-06-05 09:56:46 llm_load_print_meta: ssm_d_conv = 0 2024-06-05 09:56:46 llm_load_print_meta: ssm_d_inner = 0 2024-06-05 09:56:46 llm_load_print_meta: ssm_d_state = 0 2024-06-05 09:56:46 llm_load_print_meta: ssm_dt_rank = 0 2024-06-05 09:56:46 llm_load_print_meta: model type = 8B 2024-06-05 09:56:46 llm_load_print_meta: model ftype = Q4_0 2024-06-05 09:56:46 llm_load_print_meta: model params = 8.03 B 2024-06-05 09:56:46 llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) 2024-06-05 09:56:46 llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct 2024-06-05 09:56:46 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' 2024-06-05 09:56:46 llm_load_print_meta: EOS token = 128009 '<|eot_id|>' 2024-06-05 09:56:46 llm_load_print_meta: LF token = 128 'Ä' 2024-06-05 09:56:46 llm_load_print_meta: EOT token = 128009 '<|eot_id|>' 2024-06-05 09:56:46 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes 2024-06-05 09:56:46 ggml_cuda_init: CUDA_USE_TENSOR_CORES: no 2024-06-05 09:56:46 ggml_cuda_init: found 1 CUDA devices: 2024-06-05 09:56:46 Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes 2024-06-05 09:56:46 llm_load_tensors: ggml ctx size = 0.30 MiB 2024-06-05 09:56:46 llm_load_tensors: offloading 32 repeating layers to GPU 2024-06-05 09:56:46 llm_load_tensors: offloading non-repeating layers to GPU 2024-06-05 09:56:46 llm_load_tensors: offloaded 33/33 layers to GPU 2024-06-05 09:56:46 llm_load_tensors: CPU buffer size = 281.81 MiB 2024-06-05 09:56:46 llm_load_tensors: CUDA0 buffer size = 4155.99 MiB 2024-06-05 09:56:47 llama_new_context_with_model: n_ctx = 2048 2024-06-05 09:56:47 llama_new_context_with_model: n_batch = 512 2024-06-05 09:56:47 llama_new_context_with_model: n_ubatch = 512 2024-06-05 09:56:47 llama_new_context_with_model: flash_attn = 0 2024-06-05 09:56:47 llama_new_context_with_model: freq_base = 500000.0 2024-06-05 09:56:47 llama_new_context_with_model: freq_scale = 1 2024-06-05 09:56:47 llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB 2024-06-05 09:56:47 llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB 2024-06-05 09:56:47 llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB 2024-06-05 09:56:47 llama_new_context_with_model: CUDA0 compute buffer size = 258.50 MiB 2024-06-05 09:56:47 llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB 2024-06-05 09:56:47 llama_new_context_with_model: graph nodes = 1030 2024-06-05 09:56:47 llama_new_context_with_model: graph splits = 2 2024-06-05 09:56:47 time=2024-06-05T07:56:47.798Z level=INFO source=server.go:572 msg="llama runner started in 1.76 seconds"

om35 commented 6 months ago

hello, same issue 0.1.41

/tmp/ollama1286469287/runners/cuda_v11/ollama_llama_server: error while loading shared libraries: /usr/local/cuda/lib64/libcudart.so.11.0: file too short
time=2024-06-05T09:34:25.172Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
time=2024-06-05T09:34:25.422Z level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 127 "
[GIN] 2024/06/05 - 09:34:25 | 500 |  498.164855ms |       127.0.0.1 | POST     "/api/chat"
time=2024-06-05T09:34:30.487Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.06438619
time=2024-06-05T09:34:30.737Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.314507407
time=2024-06-05T09:34:30.987Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.564456516

ollama run mixtral:8x7b Error: llama runner process has terminated: exit status 127

dhiltgen commented 6 months ago

@om35 /usr/local/cuda/lib64/libcudart.so.11.0: file too short sounds like your local cuda install might be corrupt.

@abstract-entity it's a little hard to tell, but it seems like the subprocess is crashing. Can you try with debug enabled so we can see a little more detail?

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

then trigger a model load, and assuming it crashes, share that server.log.

pamanseau commented 5 months ago

What I found out is that NGINX INGRESS is causing this disconnection with the API so as you mentioned the Ollama stopped loading the model and caused this error.

If I connect the WebUI directly in the ClusterIP or NodePort then it's working.

Unless I configure with the Configmap NGINX keepalive timeout but then this is applied to all the cluster not just Ollama ingress so it has implications for other services of the cluster.

The helm chart is deploying Ingress but should look at GATEWAY API or find a way to keep the client connected.

pulpocaminante commented 5 months ago

Hi, this doesn't happen to me when running ollama as root directly in a shell, but it happens when I start ollama as a service (regardless of the user):

amnesia λ ~/ sudo systemctl status ollama 
● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled)
     Active: active (running) since Sun 2024-06-16 16:47:04 PDT; 39s ago
   Main PID: 7273 (ollama)
      Tasks: 18 (limit: 38365)
     Memory: 561.0M (peak: 615.8M)
        CPU: 5.644s
     CGroup: /system.slice/ollama.service
             └─7273 /usr/local/bin/ollama serve

Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.252-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\""
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.506-07:00 level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sy>
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.511-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\""
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.714-07:00 level=WARN source=sched.go:511 msg="gpu VRAM usage didn't recover within timeout" seconds=5.045401852
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.749-07:00 level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sy>
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.753-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\""
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.964-07:00 level=WARN source=sched.go:511 msg="gpu VRAM usage didn't recover within timeout" seconds=5.295260429
Jun 16 16:47:38 dead ollama[7273]: time=2024-06-16T16:47:38.007-07:00 level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sy>
Jun 16 16:47:38 dead ollama[7273]: time=2024-06-16T16:47:38.012-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\""
Jun 16 16:47:38 dead ollama[7273]: time=2024-06-16T16:47:38.214-07:00 level=WARN source=sched.go:511 msg="gpu VRAM usage didn't recover within timeout" seconds=5.545797232

But somehow:

amnesia λ ~/ sudo ROCR_VISIBLE_DEVICES=0 HSA_OVERRIDE_GFX_VERSION="10.3.0" OLLAMA_DEBUG=1 ollama serve

Works fine and I can chat without issue. Here's my service file, please note I have tried with both the ollama user and the root user (and the ollama user is properly configured/in render & video group):

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=root
Group=root
Restart=always
RestartSec=3
Environment="PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/rocm/bin:/usr/local/lib/baresip/modules"
Environment="ROCR_VISIBLE_DEVICES=0"
Environment="HSA_OVERRIDE_GFX_VERSION=\"10.3.0\""
[Install]
WantedBy=default.target

Both in the shell & run as a service they report using the same GPU (id=0, 6700XT):

level=INFO source=amd_linux.go:71 msg="inference compute" id=0 library=rocm compute=gfx1031 driver=0.0 name=1002:73df total="12.0 GiB" available="12.0 GiB"

dhiltgen commented 5 months ago

@pulpocaminante the upcoming version 0.1.45 (rc2 currently available) will report the GPU env vars in the log at startup which should help you troubleshoot the settings to figure out which one isn't getting passed in as expected.

Tai-Pham-2002 commented 3 months ago

when i run: ollama run aiden_lu/minicpm-v2.6:Q4_K_M
got this error: Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed

this my log: Aug 29 17:35:21 ai-stg ollama[777926]: time=2024-08-29T17:35:21.555Z level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed" Aug 29 17:35:21 ai-stg ollama[777926]: [GIN] 2024/08/29 - 17:35:21 | 500 | 1.011699134s | 127.0.0.1 | POST "/api/chat" Aug 29 17:35:26 ai-stg ollama[777926]: time=2024-08-29T17:35:26.718Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.163553423 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1 Aug 29 17:35:27 ai-stg ollama[777926]: time=2024-08-29T17:35:27.072Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.51679798 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1 Aug 29 17:35:27 ai-stg ollama[777926]: time=2024-08-29T17:35:27.424Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.869100664 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1

help me, please

Leon-Sander commented 2 months ago

@dhiltgen I have no problems on linux but get this error on windows. My application uses ollama as llm server, and many users work on windows and experience this error. I also added the "keep_alive": -1 into the api request but it didnt change the result. I am using the latest docker image.

ollama-1 | 2024/09/23 10:28:22 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0: app:// file:// tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" ollama-1 | time=2024-09-23T10:28:22.189Z level=INFO source=images.go:753 msg="total blobs: 20" ollama-1 | time=2024-09-23T10:28:22.307Z level=INFO source=images.go:760 msg="total unused blobs removed: 0" ollama-1 | time=2024-09-23T10:28:22.411Z level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" ollama-1 | time=2024-09-23T10:28:22.416Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12]" ollama-1 | time=2024-09-23T10:28:22.416Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs" ollama-1 | time=2024-09-23T10:28:23.178Z level=INFO source=types.go:107 msg="inference compute" id=GPU-3c819917-39e0-af79-6f9b-db1d227a2872 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3070" total="8.0 GiB" available="6.9 GiB" ollama-1 | time=2024-09-23T10:28:56.963Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-3c819917-39e0-af79-6f9b-db1d227a2872 parallel=4 available=7444889600 required="6.2 GiB" ollama-1 | time=2024-09-23T10:28:56.963Z level=INFO source=server.go:103 msg="system memory" total="7.7 GiB" free="6.4 GiB" free_swap="2.0 GiB" ollama-1 | time=2024-09-23T10:28:56.965Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" ollama-1 | time=2024-09-23T10:28:56.969Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 41745"
ollama-1 | time=2024-09-23T10:28:56.970Z level=INFO source=sched.go:449 msg="loaded runners" count=1 ollama-1 | time=2024-09-23T10:28:56.970Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding" ollama-1 | time=2024-09-23T10:28:56.971Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" ollama-1 | INFO [main] build info | build=10 commit="eaf151c" tid="140164179275776" timestamp=1727087337 ollama-1 | INFO [main] system info | n_threads=6 n_threads_batch=6 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140164179275776" timestamp=1727087337 total_threads=12 ollama-1 | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="41745" tid="140164179275776" timestamp=1727087337 ollama-1 | time=2024-09-23T10:28:57.223Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" ollama-1 | llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) ollama-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ollama-1 | llama_model_loader: - kv 0: general.architecture str = llama ollama-1 | llama_model_loader: - kv 1: general.type str = model ollama-1 | llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct ollama-1 | llama_model_loader: - kv 3: general.finetune str = Instruct ollama-1 | llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
ollama-1 | llama_model_loader: - kv 5: general.size_label str = 8B ollama-1 | llama_model_loader: - kv 6: general.license str = llama3.1 ollama-1 | llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... ollama-1 | llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... ollama-1 | llama_model_loader: - kv 9: llama.block_count u32 = 32 ollama-1 | llama_model_loader: - kv 10: llama.context_length u32 = 131072 ollama-1 | llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 ollama-1 | llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 ollama-1 | llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 ollama-1 | llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 ollama-1 | llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
ollama-1 | llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 ollama-1 | llama_model_loader: - kv 17: general.file_type u32 = 2 ollama-1 | llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 ollama-1 | llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 ollama-1 | llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 ollama-1 | llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe ollama-1 | llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... ollama-1 | llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ollama-1 | llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... ollama-1 | llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 ollama-1 | llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 ollama-1 | llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... ollama-1 | llama_model_loader: - kv 28: general.quantization_version u32 = 2 ollama-1 | llama_model_loader: - type f32: 66 tensors ollama-1 | llama_model_loader: - type q4_0: 225 tensors ollama-1 | llama_model_loader: - type q6_K: 1 tensors ollama-1 | llm_load_vocab: special tokens cache size = 256 ollama-1 | llm_load_vocab: token to piece cache size = 0.7999 MB ollama-1 | llm_load_print_meta: format = GGUF V3 (latest) ollama-1 | llm_load_print_meta: arch = llama ollama-1 | llm_load_print_meta: vocab type = BPE ollama-1 | llm_load_print_meta: n_vocab = 128256 ollama-1 | llm_load_print_meta: n_merges = 280147 ollama-1 | llm_load_print_meta: vocab_only = 0 ollama-1 | llm_load_print_meta: n_ctx_train = 131072 ollama-1 | llm_load_print_meta: n_embd = 4096 ollama-1 | llm_load_print_meta: n_layer = 32 ollama-1 | llm_load_print_meta: n_head = 32 ollama-1 | llm_load_print_meta: n_head_kv = 8 ollama-1 | llm_load_print_meta: n_rot = 128 ollama-1 | llm_load_print_meta: n_swa = 0 ollama-1 | llm_load_print_meta: n_embd_head_k = 128 ollama-1 | llm_load_print_meta: n_embd_head_v = 128 ollama-1 | llm_load_print_meta: n_gqa = 4 ollama-1 | llm_load_print_meta: n_embd_k_gqa = 1024 ollama-1 | llm_load_print_meta: n_embd_v_gqa = 1024 ollama-1 | llm_load_print_meta: f_norm_eps = 0.0e+00 ollama-1 | llm_load_print_meta: f_norm_rms_eps = 1.0e-05 ollama-1 | llm_load_print_meta: f_clamp_kqv = 0.0e+00 ollama-1 | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 ollama-1 | llm_load_print_meta: f_logit_scale = 0.0e+00 ollama-1 | llm_load_print_meta: n_ff = 14336 ollama-1 | llm_load_print_meta: n_expert = 0 ollama-1 | llm_load_print_meta: n_expert_used = 0 ollama-1 | llm_load_print_meta: causal attn = 1 ollama-1 | llm_load_print_meta: pooling type = 0 ollama-1 | llm_load_print_meta: rope type = 0 ollama-1 | llm_load_print_meta: rope scaling = linear ollama-1 | llm_load_print_meta: freq_base_train = 500000.0 ollama-1 | llm_load_print_meta: freq_scale_train = 1 ollama-1 | llm_load_print_meta: n_ctx_orig_yarn = 131072 ollama-1 | llm_load_print_meta: rope_finetuned = unknown ollama-1 | llm_load_print_meta: ssm_d_conv = 0 ollama-1 | llm_load_print_meta: ssm_d_inner = 0 ollama-1 | llm_load_print_meta: ssm_d_state = 0 ollama-1 | llm_load_print_meta: ssm_dt_rank = 0 ollama-1 | llm_load_print_meta: ssm_dt_b_c_rms = 0 ollama-1 | llm_load_print_meta: model type = 8B ollama-1 | llm_load_print_meta: model ftype = Q4_0 ollama-1 | llm_load_print_meta: model params = 8.03 B ollama-1 | llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) ollama-1 | llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct ollama-1 | llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' ollama-1 | llm_load_print_meta: EOS token = 128009 '<|eot_id|>' ollama-1 | llm_load_print_meta: LF token = 128 'Ä' ollama-1 | llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ollama-1 | llm_load_print_meta: max token length = 256 ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ollama-1 | ggml_cuda_init: found 1 CUDA devices: ollama-1 | Device 0: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes ollama-1 | llm_load_tensors: ggml ctx size = 0.27 MiB ollama-1 | time=2024-09-23T10:33:57.075Z level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - " ollama-1 | [GIN] 2024/09/23 - 10:33:57 | 500 | 5m0s | 172.18.0.3 | POST "/api/chat" app-1 | {'error': 'timed out waiting for llama runner to start - progress 0.00 - '} ollama-1 | time=2024-09-23T10:34:02.206Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.131559562 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
ollama-1 | time=2024-09-23T10:34:02.455Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.380732173 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
ollama-1 | time=2024-09-23T10:34:02.716Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.641832884 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe

dhiltgen commented 2 months ago

@Leon-Sander you can se OLLAMA_LOAD_TIMEOUT to adjust the timeout if your system needs more than 5m to load the model. If that doesn't help get you past it, please open a new issue with your server logs so we can investigate.

Leon-Sander commented 2 months ago

@dhiltgen got it, thanks.

Do you have an Idea why the loading/offloading on gpu takes that much time on windows? On linux llama3.1 is loaded in 10 seconds, but on windows it takes me 5 minutes on the same computer. I have good hardware. Inference time seems to be pretty much the same as on linux, just the model loading ins unbearable.

Edit: The answer seems to be the wsl2 based docker image I/O back to the NTFS filesystem can be slow #6006 I just tested it without docker and the loading is as fast as on linux.

JoffreyLemeryAncileo commented 1 week ago

@dhiltgen, thank for all your activity !

For my company i do some testing to see on which instance type on Azure we gonna deploy our model. I installed ollama version is 0.4.1 on ubuntu 22.04 - NC64asT4v3 series

I installed Cuda 12.6 and i have all the GPU available. As for many, i uptaded my ollama.service like this to make it accessible by API call :

`[Unit] Description=Ollama Service After=network-online.target

[Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" Environment="OLLAMA_HOST=0.0.0.0" Environment="CUDA_VISIBLE_DEVICES=0,1,2,3" Environment="OLLAMA_LOG_LEVEL=debug" Environment="OLLAMA_GPU=1"

[Install] WantedBy=default.target`

When i run gemna2;27B for example, it is quite long depiste 4GPUs. I dig a bit and see than just 4/47 of layers are offloads, the CPU explose and GPU are almost not used.

==> I duplicate the model with a Modelfile and control the offloading with : PARAMETER num_gpu 47 I also made it inside gemma2:27B using / set_parameter num_gpu 47

It increase a bit the performance. I want now to validate with API call from local laptop. However, no matter the model i call or the parameter, it is always a small offloading 4/47

I understand from other ticket you repleid that ollam aims to calculate itself the vram. Which is strange because with the 4 GPU, i have enough vRAM to load all layers and run them theorically.

So i wonder it the calculation works really and if there is a way to by pass it using python request payload A point a noticied : the logs at the begining always tell me that the GPU vRAM didn't recover

LOGS : Nov 20 10:17:57 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:17:57.777Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.381064304 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:17:59 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:17:59.348Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=7.952129341 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:18:00 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:00.890Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=9.494299231 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:18:02 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:02.431Z level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc library=cuda parallel=1 required="46.3 GiB" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.039Z level=INFO source=server.go:105 msg="system memory" total="432.9 GiB" free="427.0 GiB" free_swap="0 B" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.040Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=4 layers.model=47 layers.offload=4 layers.split=1,1,1,1 memory.available="[14.5 GiB 14.5 GiB 14.5 GiB 14.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="99.8 GiB" memory.required.partial="46.3 GiB" memory.required.kv="44.9 GiB" memory.required.allocations="[11.6 GiB 11.6 GiB 11.6 GiB 11.6 GiB]" memory.weights.total="58.6 GiB" memory.weights.repeating="57.7 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="8.6 GiB" memory.graph.partial="8.6 GiB" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.042Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3490198133/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 128000 --batch-size 512 --n-gpu-layers 4 --threads 64 --parallel 1 --tensor-split 1,1,1,1 --port 45325" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.043Z level=INFO source=sched.go:449 msg="loaded runners" count=1 Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.043Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.044Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.069Z level=INFO source=runner.go:863 msg="starting go runner"

[...]

Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: ggml_cuda_init: found 4 CUDA devices: Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 0: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 1: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 2: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 3: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: ggml ctx size = 1.14 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: offloading 4 repeating layers to GPU Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: offloaded 4/47 layers to GPU Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CPU buffer size = 14898.60 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA0 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA1 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA2 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA3 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_ctx = 128000 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_batch = 512 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_ubatch = 512 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: flash_attn = 0 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: freq_base = 10000.0 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: freq_scale = 1 Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA_Host KV buffer size = 42000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA0 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA1 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA2 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA3 KV buffer size = 1000.00 MiB

Is y GPU config wromng which create vRAM unrelease or the offloading 4/47 is really the maximum, or the num_gpu automatic calculation is wrong ? Or maybe all of them ?

Cheers and thanks !