Closed iplayfast closed 8 months ago
Hi, Do you have tried with systemctl restart ollama.service
after each attempt?
Yes, that does clear the problem, but of course by then the program is borked. It isn't a good fix, if that is what you are suggesting. But it does reset ollama
Thanks for reporting this @iplayfast I think this could have been fixed in the most recent release. Please let me know if you're still seeing issues.
No, still occurs... Some thoughts:
version 0.1.20 did better, but my torture test still killed it.
python CreateNotes.py
mixtral:latest
notux:latest
dolphin-mixtral:latest
Guido:latest
alfred:latest
phind-codellama:latest
codebooga:latest
deepseek-coder:33b
nexusraven:latest
everythinglm:latest
orca2:13b
codeup:latest
wizardlm-uncensored:latest
eas/nous-hermes-2-solar-10.7b:latest
solar:latest
llama-pro:latest
bakllava:latest
llava:latest
falcon:latest
Error: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cdba10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cf6f90>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cd8b50>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cf7110>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cfe750>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cfe5d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cf74d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9d01f10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cd8a10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cf6b90>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9d0d650>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cf5990>: Failed to establish a new connection: [Errno 111] Connection refused'))
I became suspicious when after testing again, it died on falcon again. So I tried falcon on it's own. It died. I tried removing falcon and reinstalling it. Still died. The problem might be with falcon.
Could you capture server logs from the time around the crash?
I just finished running it with version 0.1.22 and it made it much farther in the test. It now doesn't crash but seems to be stuck in some infinite loop. While the test was running I did a systemctl restart ollama and it carried on after missing a few questions. I've updated my stress test so that questions are asked first and then evaluated after so there is less swapping of llms. github repo (see above) has been updated with CreateNotes and ViewResutls and the results.json. The questions are asked from largest model to smallest.
As for server logs, where would they be located, as I can't find them?
My current models are:
ollama list
NAME ID SIZE MODIFIED
chris/openhermes-agent:latest c674d4614455 5.1 GB 10 days ago
eas/nous-hermes-2-solar-10.7b:latest 5986dba75154 6.5 GB 3 weeks ago
DrunkSally:latest 7b378c3757fc 3.8 GB 6 weeks ago
Guido:latest 158599e734fb 26 GB 6 weeks ago
Jim:latest 2c7476fb37de 3.8 GB 2 months ago
Mario:latest 902e3a8e5ed7 3.8 GB 2 months ago
MrT:latest e792712b8728 3.8 GB 6 weeks ago
Polly:latest 19982222ada1 4.1 GB 2 months ago
Sally:latest 903b51bbe623 3.8 GB 6 weeks ago
Ted:latest fdabf1286f32 4.1 GB 6 weeks ago
alfred:latest e46325710c52 23 GB 2 months ago
codebooga:latest 05b83c5673dc 19 GB 2 months ago
codellama:latest 8fdf8f752f6e 3.8 GB 2 months ago
codeup:latest 54289661f7a9 7.4 GB 2 months ago
deepseek-coder:33b acec7c0b0fd9 18 GB 3 weeks ago
deepseek-coder:latest 3ddd2d3fc8d2 776 MB 3 weeks ago
deepseek-llm:latest 9aab369a853b 4.0 GB 6 weeks ago
dolphin-mistral:latest ecbf896611f5 4.1 GB 2 weeks ago
dolphin-mixtral:latest cfada4ba31c7 26 GB 3 weeks ago
dolphin-phi:latest c5761fc77240 1.6 GB 5 weeks ago
duckdb-nsql:latest 7a42116386ac 3.8 GB 3 days ago
everythinglm:latest b005372bc34b 7.4 GB 3 weeks ago
llama-pro:latest fc5c0d744444 4.7 GB 2 weeks ago
llama2:13b d475bf4c50bc 7.4 GB 6 days ago
llama2:70b e7f6c06ffef4 38 GB 6 days ago
llama2:7b 78e26419b446 3.8 GB 6 days ago
llama2:latest 78e26419b446 3.8 GB 3 weeks ago
llama2-uncensored:latest 44040b922233 3.8 GB 2 months ago
llava:latest cd3274b81a85 4.5 GB 3 weeks ago
magicoder:latest 8007de06f5d9 3.8 GB 7 weeks ago
medllama2:latest a53737ec0c72 3.8 GB 2 months ago
mistral:7b 61e88e884507 4.1 GB 3 weeks ago
mistral:instruct 61e88e884507 4.1 GB 3 weeks ago
mistral:latest 61e88e884507 4.1 GB 3 weeks ago
mistral:text d19e34de4cb6 4.1 GB 3 weeks ago
mistrallite:latest 5393d4f5f262 4.1 GB 2 months ago
mixtral:latest 7708c059a8bb 26 GB 3 weeks ago
neural-chat:latest 89fa737d3b85 4.1 GB 3 weeks ago
nexusraven:latest 483a8282af74 7.4 GB 11 days ago
notus:latest 43c512e16786 4.1 GB 4 weeks ago
notux:latest fe14e7d66184 26 GB 4 weeks ago
nous-hermes2-mixtral:latest 599da8dce2c1 26 GB 13 days ago
nsfw:latest 328546e02f6f 13 GB 3 days ago
nsfwstoryteller:latest 328546e02f6f 13 GB 3 days ago
openhermes:latest 95477a2659b7 4.1 GB 4 weeks ago
openhermes-agent:latest 4d82cc75e3aa 5.1 GB 11 days ago
openhermes2.5-mistral:latest ca4cd4e8a562 4.1 GB 2 months ago
orca-mini:latest 2dbd9f439647 2.0 GB 6 days ago
orca2:13b a8dcfac3ac32 7.4 GB 2 months ago
orca2:latest ea98cc422de3 3.8 GB 7 weeks ago
phi:latest e2fd6321a5fe 1.6 GB 3 weeks ago
phind-codellama:latest 566e1b629c44 19 GB 3 weeks ago
qwen:latest 0fddaff90ef5 4.5 GB 6 days ago
samantha-mistral:latest f7c8c9be1da0 4.1 GB 2 months ago
solar:latest 059fdabbe6e6 6.1 GB 6 weeks ago
sqlcoder:latest 77ac14348387 4.1 GB 2 months ago
stable-code:latest aa5ab8afb862 1.6 GB 11 days ago
stablelm-zephyr:latest 0a108dbd846e 1.6 GB 3 weeks ago
stablelm2:latest ea04e74d6b59 982 MB 3 days ago
starling-lm:latest ff4752739ae4 4.1 GB 3 weeks ago
tinydolphin:latest 97c9685cc5db 636 MB 3 days ago
tinyllama:latest 2644915ede35 637 MB 3 weeks ago
wizard-math:latest 5ab8dc2115d3 4.1 GB 5 weeks ago
wizard-vicuna-uncensored:7b 72fc3c2b99dc 3.8 GB 6 weeks ago
wizard-vicuna-uncensored:latest 72fc3c2b99dc 3.8 GB 2 months ago
wizardcoder:latest de9d848c1323 3.8 GB 4 weeks ago
wizardlm-uncensored:latest 886a369d74fc 7.4 GB 7 weeks ago
xwinlm:latest 0fa68068d970 3.8 GB 2 months ago
yarn-mistral:latest 8e9c368a0ae4 4.1 GB 6 weeks ago
yi:latest a86526842143 3.5 GB 3 weeks ago
zephyr:latest bbe38b81adec 4.1 GB 3 weeks ago
It seemed that sqlcoder started having problems, answering questions in strange ways. The results.json file can be searched for ": No Answer due to error"
The question "what fills you with joy" running from just the command line seemed to give a very long answer. and my software failed here, I restarted the server after several hours. Perhaps that's why as it's a code completion model.
Given that, Code completion models are so different than chat models there should be a way that:
As for server logs, where would they be located, as I can't find them?
Depends on your platform. Check out https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md
yikes that's a lot of data, are you looking for anything in partiuclar? I've included a small sample of around the time. (note to self journalctl -u ollama -S "2024-01-30 17:01:45")
:14:33 FORGE ollama[2004316]: [GIN] 2024/01/29 - 03:14:33 | 200 | 208.013µs | 127.0.0.1 | POST "/api/show"
Jan 29 03:14:33 FORGE ollama[2004316]: 2024/01/29 03:14:33 gpu.go:140: INFO CUDA Compute Capability detected: 8.9
Jan 29 03:14:33 FORGE ollama[2004316]: 2024/01/29 03:14:33 gpu.go:140: INFO CUDA Compute Capability detected: 8.9
Jan 29 03:14:33 FORGE ollama[2004316]: 2024/01/29 03:14:33 cpu_common.go:11: INFO CPU has AVX2
Jan 29 03:14:33 FORGE ollama[2004316]: 2024/01/29 03:14:33 dyn_ext_server.go:90: INFO Loading Dynamic llm server: /tmp/ollama4251586406/cuda_v11/libext_server.>
Jan 29 03:14:33 FORGE ollama[2004316]: 2024/01/29 03:14:33 dyn_ext_server.go:145: INFO Initializing llama server
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs>
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 0: general.architecture str = llama
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 1: general.name str = teknium
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 2: llama.context_length u32 = 32768
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 4: llama.block_count u32 = 32
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 11: general.file_type u32 = 2
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0>
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.00000>
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, >
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool = true
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool = false
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 20: tokenizer.chat_template str = {% for message in messages %>
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv 21: general.quantization_version u32 = 2
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - type f32: 65 tensors
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - type q4_0: 225 tensors
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - type q6_K: 1 tensors
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_vocab: special tokens definition check successful ( 261/32002 ).
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: format = GGUF V3 (latest)
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: arch = llama
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: vocab type = SPM
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_vocab = 32002
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_merges = 0
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_ctx_train = 32768
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_embd = 4096
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_head = 32
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_head_kv = 8
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_layer = 32
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_rot = 128
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_embd_head_k = 128
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_embd_head_v = 128
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_gqa = 4
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_embd_k_gqa = 1024
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_embd_v_gqa = 1024
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: f_norm_eps = 0.0e+00
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_ff = 14336
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_expert = 0
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_expert_used = 0
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: rope scaling = linear
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: freq_base_train = 10000.0
an 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: freq_scale_train = 1
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_yarn_orig_ctx = 32768
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: rope_finetuned = unknown
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: model type = 7B
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: model ftype = Q4_0
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: model params = 7.24 B
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: model size = 3.83 GiB (4.54 BPW)
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: general.name = teknium
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: BOS token = 1 '<s>'
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: EOS token = 32000 '<|im_end|>'
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: UNK token = 0 '<unk>'
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: LF token = 13 '<0x0A>'
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_tensors: ggml ctx size = 0.22 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llm_load_tensors: offloading 32 repeating layers to GPU
Jan 29 03:14:35 FORGE ollama[2004316]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 29 03:14:35 FORGE ollama[2004316]: llm_load_tensors: offloaded 33/33 layers to GPU
Jan 29 03:14:35 FORGE ollama[2004316]: llm_load_tensors: CPU buffer size = 70.32 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llm_load_tensors: CUDA0 buffer size = 3847.56 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: ...................................................................................................
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: n_ctx = 2048
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: freq_base = 10000.0
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: freq_scale = 1
Jan 29 03:14:35 FORGE ollama[2004316]: llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: CUDA_Host input buffer size = 12.01 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: CUDA0 compute buffer size = 156.00 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: graph splits (measure): 3
Jan 29 03:14:35 FORGE ollama[2004316]: 2024/01/29 03:14:35 dyn_ext_server.go:156: INFO Starting llama main loop
Jan 29 03:14:35 FORGE ollama[2004316]: [GIN] 2024/01/29 - 03:14:35 | 200 | 2.247827969s | 127.0.0.1 | POST "/api/chat"
Jan 29 03:14:56 FORGE ollama[2004316]: 2024/01/29 03:14:56 dyn_ext_server.go:170: INFO loaded 0 images
Jan 29 03:14:57 FORGE ollama[2004316]: [GIN] 2024/01/29 - 03:14:57 | 200 | 358.002761ms | 127.0.0.1 | POST "/api/chat"
Jan 29 03:15:38 FORGE ollama[2004316]: 2024/01/29 03:15:38 dyn_ext_server.go:170: INFO loaded 0 images
Here is the function that eventually fails
def get_answer(ollama, question, timeout=1000):
start_time = time.time()
result = ''
"""Get an answer from the Ollama model with a timeout."""
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(ollama, question)
try:
result = future.result(timeout=timeout).strip()
except concurrent.futures.TimeoutError:
print(f"Timed out after {timeout} seconds for question: {question}")
result = 'No Answer due to timeout'
except Exception as e:
print(f"Error: {e}")
result = 'No Answer due to error'
end_time = time.time()
elapsed_time = end_time - start_time
return result.strip(), elapsed_time
# Usage in your loop remains the same
Here is log at the time of the timeout. (after 1500 seconds)
Jan 30 20:46:10 FORGE ollama[3131650]: 2024/01/30 20:46:10 dyn_ext_server.go:145: INFO Initializing llama server
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256:4a3019290402c9eadf89a3bf793102a52a2a44dd76ea7b07fca53f9cbb789a63 (version GGUF V2)
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 0: general.architecture str = llama
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 1: general.name str = ehartford
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 2: llama.context_length u32 = 32768
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 4: llama.block_count u32 = 32
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 11: general.file_type u32 = 2
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000...
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv 18: general.quantization_version u32 = 2
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - type f32: 65 tensors
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - type q4_0: 225 tensors
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - type q6_K: 1 tensors
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_vocab: special tokens definition check successful ( 261/32002 ).
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: format = GGUF V2
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: arch = llama
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: vocab type = SPM
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_vocab = 32002
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_merges = 0
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_ctx_train = 32768
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_embd = 4096
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_head = 32
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_head_kv = 8
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_layer = 32
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_rot = 128
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_embd_head_k = 128
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_embd_head_v = 128
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_gqa = 4
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_embd_k_gqa = 1024
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_embd_v_gqa = 1024
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: f_norm_eps = 0.0e+00
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_ff = 14336
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_expert = 0
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_expert_used = 0
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: rope scaling = linear
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: freq_base_train = 10000.0
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: freq_scale_train = 1
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_yarn_orig_ctx = 32768
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: rope_finetuned = unknown
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: model type = 7B
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: model ftype = Q4_0
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: model params = 7.24 B
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: model size = 3.83 GiB (4.54 BPW)
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: general.name = ehartford
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: BOS token = 1 '<s>'
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: EOS token = 32000 '<|im_end|>'
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: UNK token = 0 '<unk>'
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: LF token = 13 '<0x0A>'
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors: ggml ctx size = 0.22 MiB
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors: offloading 32 repeating layers to GPU
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors: offloaded 33/33 layers to GPU
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors: CPU buffer size = 70.32 MiB
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors: CUDA0 buffer size = 3847.56 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: ..................................................................................................
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: n_ctx = 2048
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: freq_base = 10000.0
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: freq_scale = 1
Jan 30 20:46:11 FORGE ollama[3131650]: llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: CUDA_Host input buffer size = 12.01 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: CUDA0 compute buffer size = 156.00 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: graph splits (measure): 3
Jan 30 20:46:11 FORGE ollama[3131650]: 2024/01/30 20:46:11 dyn_ext_server.go:156: INFO Starting llama main loop
Jan 30 20:46:11 FORGE ollama[3131650]: 2024/01/30 20:46:11 dyn_ext_server.go:170: INFO loaded 0 images
This should be resolved by #3218
I feel this is a major bug, as anyone using ollama for an extended time using several models will have the same issue.
I'm using https://github.com/iplayfast/OllamaPlayground/tree/main/createnotes#readme which tests all the models on your system. It initially loads each model and says hello just to test. This is where the problem lies.
ollama serve Error: listen tcp 127.0.0.1:11434: bind: address already in use
These are my models:
This is the output after loading them one after another:
sqlcoder isn't a big model. I had originally thought meditron was the problem so I removed it. and it just went to the next one. mixtralcpu is from https://ollama.ai/chris/mixtralcpu which uses loads into memory instead of the gpu. (It loaded from command line fine).