Closed hcr707305003 closed 3 months ago
this is my building_qwen_7b_gguf.Modelfile
FROM test_quantize-q8_0.gguf
# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER repeat_penalty 1.05
PARAMETER top_k 20
TEMPLATE """{{ if and .First .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}"""
# set the system message
SYSTEM """
You are a helpful assistant.
"""
I'm similar to you. I'm running a fine-tuned model. convert step as follow:
python convert-hf-to-gguf.py /content/LLaMA-Factory/p1 --outfile /content/drive/MyDrive/model/qwen1_5-1.8b-chat-fp16.gguf
step:
ollama run qwen_p Error: llama runner process has terminated: exit status 0xc0000409
logs: time=2024-05-15T17:57:25.717+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server loading model" llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' llama_load_model_from_file: exception loading model time=2024-05-15T17:57:26.113+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server error" time=2024-05-15T17:57:26.371+08:00 level=ERROR source=sched.go:339 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 " [GIN] 2024/05/15 - 17:57:26 | 500 | 3.2125605s | 127.0.0.1 | POST "/api/chat"
Each time I try to load these models, I get the same error. Could anyone provide a fix? Thank you in advance :)
me too
D:\llama.cpp>ollama create eduaigc -f modelfile
transferring model data
using existing layer sha256:28ce318a0cda9dac3b5561c944c16c7e966b07890bed5bb12e122646bc8d71c4
creating new layer sha256:58353639a7c4b7529da8c5c8a63e81c426f206bab10cf82e4b9e427f15a466f8
creating new layer sha256:1da117d6723df114af0d948b614cae0aa684875e2775ca9607d23e2e0769651d
creating new layer sha256:9297f08dd6c6435240b5cddc93261e8a159aa0fecf010de4568ec2df2417bdb2
creating new layer sha256:14d7a26fe5b8e2168e038646c5fb6b0048e27c33628abda8d92ebfed0f369b9f
writing manifest
success
D:\llama.cpp>ollama run eduaigc
Error: llama runner process has terminated: exit status 0xc0000409
modelfile
FROM eduaigc-Q4_0.gguf
# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER repeat_penalty 1.05
PARAMETER top_k 20
TEMPLATE """{{ if and .First .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}"""
# set the system message
SYSTEM """
You are a helpful assistant.
"""
I have the same problem: use llama_factory to have a lora. export model:
model_name_or_path: /hy-tmp/model/qwen/Qwen1___5-7B-Chat adapter_name_or_path: /hy-tmp/model/checkpoint7 template: qwen finetuning_type: lora
export_dir: /hy-tmp/qwen7 export_size: 2 export_device: cpu export_legacy_format: false
and use llama.cpp to convert : python convert-hf-to-gguf.py /hy-tmp/qwen7
it works: (info_extra) root@7eff8c7865f0:~/project/llama.cpp# ./main -m /hy-tmp/qwen7/ggml-model-f16.gguf -n 512 --color -i -cml -f prompts/chat-with-qwen.txt Log start main: build = 2887 (583fd6b0) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1716201079 llama_model_loader: loaded meta data with 21 key-value pairs and 387 tensors from /hy-tmp/qwen7/ggml-model-f16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = qwen7 llama_model_loader: - kv 2: qwen2.block_count u32 = 32 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 4096 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 32 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 32 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% set system_message = 'You are a he... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type f16: 226 tensors llm_load_vocab: special tokens definition check successful ( 293/151936 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 7.72 B llm_load_print_meta: model size = 14.38 GiB (16.00 BPW) llm_load_print_meta: general.name = qwen7 llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.18 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/33 layers to GPU llm_load_tensors: CPU buffer size = 14728.52 MiB ...................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1491.75 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 17.01 MiB llama_new_context_with_model: graph nodes = 1126 llama_new_context_with_model: graph splits = 452
system_info: n_threads = 8 / 24 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | main: interactive mode on. Reverse prompt: '<|im_start|>user ' sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 2048, n_predict = 512, n_keep = 11
== Running in interactive mode. ==
<|endoftext|><|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user
你好啊 你好!有什么问题或需要帮助吗?<|im_end|>
the same problem in ollama:
time=2024-05-20T10:41:19.127Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model" llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' llama_load_model_from_file: exception loading model terminate called after throwing an instance of 'std::runtime_error' what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' time=2024-05-20T10:41:19.379Z level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) " [GIN] 2024/05/20 - 10:41:19 | 500 | 1.463489615s | 127.0.0.1 | POST "/api/chat" time=2024-05-20T10:41:24.584Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.205149643 time=2024-05-20T10:41:24.863Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.484359346 time=2024-05-20T10:41:25.142Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.762816273
I have the same error ollama run qwen2:0.5b Error: llama runner process has terminated: exit status 0xc0000409
my ollama version was v0.1.38 ,but when I upgraded to v0.1.42 problem solved and the llm runs successfully
I just try: ollama run hhao/openbmb-minicpm-llama3-v-2_5 with no other configuration. Windows 11, CPU, ollama v0.1.42 - I'm gettting same error.
I just try: ollama run hhao/openbmb-minicpm-llama3-v-2_5 with no other configuration. Windows 11, CPU, ollama v0.1.42 - I'm gettting same error.
try download and install again
try download and install again
Thanks, I though about restarting PC, but didn't think of reinstalling model. Tried ollama rm and then run again, but unfortunately still same error:
ollama run hhao/openbmb-minicpm-llama3-v-2_5 pulling manifest pulling 391d11736c3c... 100% ▕████████████████████████████████████████████████████████▏ 1.0 GB pulling 010ec3ba94cb... 100% ▕████████████████████████████████████████████████████████▏ 4.9 GB pulling 8ab4849b038c... 100% ▕████████████████████████████████████████████████████████▏ 254 B pulling 2c527a8fcba5... 100% ▕████████████████████████████████████████████████████████▏ 124 B pulling ada64ec88682... 100% ▕████████████████████████████████████████████████████████▏ 493 B verifying sha256 digest writing manifest removing any unused layers success Error: llama runner process has terminated: exit status 0xc0000409
ollama run qwen:1.8b
Error: llama runner process has terminated: exit status 0xc0000409 CUDA error"
time=2024-06-12T09:10:33.042+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=25 memory.available="11.0 GiB" memory.required.full="2.0 GiB" memory.required.partial="2.0 GiB" memory.required.kv="384.0 MiB" memory.weights.total="895.7 MiB" memory.weights.repeating="652.3 MiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="300.8 MiB" memory.graph.partial="544.2 MiB" time=2024-06-12T09:10:33.042+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=25 memory.available="11.0 GiB" memory.required.full="2.0 GiB" memory.required.partial="2.0 GiB" memory.required.kv="384.0 MiB" memory.weights.total="895.7 MiB" memory.weights.repeating="652.3 MiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="300.8 MiB" memory.graph.partial="544.2 MiB" time=2024-06-12T09:10:33.048+08:00 level=INFO source=server.go:341 msg="starting llama server" cmd="C:\Users\Administrator\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model C:\Users\Administrator\.ollama\models\blobs\sha256-1296b084ed6bc4c6eaee99255d73e9c715d38e0087b6467fd1c498b908180614 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 1 --port 63328" time=2024-06-12T09:10:33.052+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-12T09:10:33.052+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-12T09:10:33.053+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3051 commit="5921b8f0" tid="21596" timestamp=1718154633 INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="21596" timestamp=1718154633 total_threads=12 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="63328" tid="21596" timestamp=1718154633 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from C:\Users\Administrator.ollama\models\blobs\sha256-1296b084ed6bc4c6eaee99255d73e9c715d38e0087b6467fd1c498b908180614 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = Qwen2-beta-1_8B-Chat llama_model_loader: - kv 2: qwen2.block_count u32 = 24 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 2048 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 5504 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: qwen2.use_parallel_residual bool = true llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 15: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 17: tokenizer.chattemplate str = {% for message in messages %}{{'<|im... llama_model_loader: - kv 18: general.quantization_version u32 = 2 llama_model_loader: - kv 19: general.file_type u32 = 2 llama_model_loader: - type f32: 121 tensors llama_model_loader: - type q4_0: 169 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-06-12T09:10:33.317+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default' llm_load_vocab: special tokens cache size = 293 llm_load_vocab: token to piece cache size = 1.8676 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 5504 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 1.84 B llm_load_print_meta: model size = 1.04 GiB (4.85 BPW) llm_load_print_meta: general.name = Qwen2-beta-1_8B-Chat llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151643 '<|endoftext|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.28 MiB llm_load_tensors: offloading 24 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 25/25 layers to GPU llm_load_tensors: CPU buffer size = 166.92 MiB llm_load_tensors: CUDA0 buffer size = 895.75 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB llama_new_context_with_model: CUDA0 compute buffer size = 300.75 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.01 MiB llama_new_context_with_model: graph nodes = 846 llama_new_context_with_model: graph splits = 2 fatal : Memory allocation failure CUDA error: CUBLAS_STATUS_NOT_INITIALIZED current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda/common.cuh:653 cublasCreate_v2(&cublas_handles[device]) GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:100: !"CUDA error" time=2024-06-12T09:10:39.665+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" time=2024-06-12T09:10:39.923+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 CUDA error\"" [GIN] 2024/06/12 - 09:10:39 | 500 | 9.0390595s | 127.0.0.1 | POST "/api/chat"
before I sleep is right , but today is become bad
got update to 0.1.43 - still same error. As per DHclly - seems it's not only on CPU (Now I will be afraid to go to sleep when it works!)
got update to 0.1.43 - still same error. As per DHclly - seems it's not only on CPU (Now I will be afraid to go to sleep when it works!)
It's amazing,after 1 hours , I restart it , it run very well on nvidia gpu , now , it's running success, but I don't know why.
me too `C:\Users\LI>ollama run llama3 pulling manifest pulling 6a0746a1ec1a... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.7 GB pulling 4fa551d4f938... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 12 KB pulling 8ab4849b038c... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 254 B pulling 577073ffcc6c... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 110 B pulling 3f8eb4da87fa... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 485 B verifying sha256 digest writing manifest removing any unused layers success Error: llama runner process has terminated: exit status 0xc0000409 error:failed to create context with model 'C:\Users\LI.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa'
C:\Users\LI> C:\Users\LI> C:\Users\LI>ollama run llama3 Error: llama runner process has terminated: exit status 0xc0000409 error:failed to create context with model 'C:\Users\LI.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa'`
Unfortunately the exit code 0xc0000409 just indicates something went wrong. It looks like there are multiple unrelated topics in this issue.
For people trying to use qwen, please make sure to upgrade to the latest version, as fixes have gone in over the past few releases which should hopefully resolve those.
For people trying to create their own models which are causing the server to crash, please share your server log which may help understand which property/parameter caused the failure.
For the Memory allocation failure, please make sure you're running the latest version, and if that doesn't clear it, please share your server log.
same error here. @dhiltgen
Here are my logs :
2024/07/05 14:14:20 routes.go:1064: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:C:\Users\DELL\.ollama\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0: app:// file:// tauri://*] OLLAMA_RUNNERS_DIR:C:\Users\DELL\AppData\Local\Programs\Ollama\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-05T14:14:20.644+02:00 level=INFO source=images.go:730 msg="total blobs: 0" time=2024-07-05T14:14:20.644+02:00 level=INFO source=images.go:737 msg="total unused blobs removed: 0" time=2024-07-05T14:14:20.645+02:00 level=INFO source=routes.go:1111 msg="Listening on 127.0.0.1:11434 (version 0.1.48)" time=2024-07-05T14:14:20.645+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11.3 rocm_v5.7 cpu cpu_avx]" time=2024-07-05T14:14:21.533+02:00 level=INFO source=types.go:98 msg="inference compute" id=GPU-15161996-1a7c-8143-bc65-810c3bf997fb library=cuda compute=7.5 driver=0.0 name="" total="6.0 GiB" available="5.0 GiB" [GIN] 2024/07/05 - 14:14:33 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/05 - 14:14:33 | 404 | 575.7µs | 127.0.0.1 | POST "/api/show" time=2024-07-05T14:14:35.466+02:00 level=INFO source=download.go:136 msg="downloading 6a0746a1ec1a in 47 100 MB part(s)" time=2024-07-05T14:17:40.917+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 11 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:17:45.920+02:00 level=INFO source=download.go:251 msg="6a0746a1ec1a part 11 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection." time=2024-07-05T14:18:16.571+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 16 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:18:48.245+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 7 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:18:56.308+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 23 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:18:59.772+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 9 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:00.704+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 29 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:15.866+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 5 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:21.075+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 19 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:31.399+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 33 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:37.085+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 20 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:42.827+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 45 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:49.355+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 26 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:04.830+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 34 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:21.486+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 37 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:31.388+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 4 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:34.714+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 2 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:43.434+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 38 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:50.118+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 44 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:21:27.716+02:00 level=INFO source=download.go:136 msg="downloading 4fa551d4f938 in 1 12 KB part(s)" time=2024-07-05T14:21:29.588+02:00 level=INFO source=download.go:136 msg="downloading 8ab4849b038c in 1 254 B part(s)" time=2024-07-05T14:21:31.512+02:00 level=INFO source=download.go:136 msg="downloading 577073ffcc6c in 1 110 B part(s)" time=2024-07-05T14:21:33.288+02:00 level=INFO source=download.go:136 msg="downloading 3f8eb4da87fa in 1 485 B part(s)" [GIN] 2024/07/05 - 14:21:44 | 200 | 7m11s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/07/05 - 14:21:44 | 200 | 17.08ms | 127.0.0.1 | POST "/api/show" time=2024-07-05T14:21:44.703+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[5.8 GiB]" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[5.0 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.5 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" time=2024-07-05T14:21:44.706+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="C:\Users\DELL\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 1 --port 59978" time=2024-07-05T14:21:44.730+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1 time=2024-07-05T14:21:44.730+02:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding" time=2024-07-05T14:21:44.730+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3171 commit="7c26775a" tid="10184" timestamp=1720182105 INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="10184" timestamp=1720182105 total_threads=12 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="59978" tid="10184" timestamp=1720182105 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe time=2024-07-05T14:21:45.247+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4155.99 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 258.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="10184" timestamp=1720182107 time=2024-07-05T14:21:48.178+02:00 level=INFO source=server.go:599 msg="llama runner started in 3.45 seconds" [GIN] 2024/07/05 - 14:21:48 | 200 | 3.5127842s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/05 - 14:22:07 | 200 | 11.0666417s | 127.0.0.1 | POST "/api/chat" time=2024-07-05T14:26:05.227+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[5.8 GiB]" memory.required.full="5.4 GiB" memory.required.partial="5.4 GiB" memory.required.kv="487.5 MiB" memory.required.allocations="[5.4 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="283.4 MiB" memory.graph.partial="677.5 MiB" time=2024-07-05T14:26:05.230+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="C:\Users\DELL\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 3900 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 1 --port 60034" time=2024-07-05T14:26:05.233+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1 time=2024-07-05T14:26:05.233+02:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding" time=2024-07-05T14:26:05.234+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3171 commit="7c26775a" tid="23620" timestamp=1720182365 INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="23620" timestamp=1720182365 total_threads=12 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="60034" tid="23620" timestamp=1720182365 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-05T14:26:05.485+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4155.99 MiB llama_new_context_with_model: n_ctx = 3904 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 488.00 MiB llama_new_context_with_model: KV self size = 488.00 MiB, K (f16): 244.00 MiB, V (f16): 244.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 283.63 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 15.63 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 CUDA error: CUBLAS_STATUS_ALLOC_FAILED current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda/common.cuh:826 cublasCreate_v2(&cublas_handles[device]) GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:100: !"CUDA error" time=2024-07-05T14:26:08.104+02:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 CUDA error\"" [GIN] 2024/07/05 - 14:26:08 | 500 | 3.2221016s | 127.0.0.1 | POST "/api/chat" time=2024-07-05T14:26:13.138+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0337086 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-05T14:26:13.386+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2815106 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-05T14:26:13.634+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5300006 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa [GIN] 2024/07/05 - 14:30:22 | 404 | 0s | 127.0.0.1 | GET "/api/chat" [GIN] 2024/07/05 - 14:31:49 | 200 | 0s | 127.0.0.1 | GET "/" [GIN] 2024/07/05 - 14:32:21 | 404 | 0s | 127.0.0.1 | GET "/api/chat" time=2024-07-05T15:59:52.039+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[5.8 GiB]" memory.required.full="5.4 GiB" memory.required.partial="5.4 GiB" memory.required.kv="487.5 MiB" memory.required.allocations="[5.4 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="283.4 MiB" memory.graph.partial="677.5 MiB" time=2024-07-05T15:59:52.044+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="C:\Users\DELL\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 3900 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 1 --port 60942" time=2024-07-05T15:59:52.073+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1 time=2024-07-05T15:59:52.073+02:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding" time=2024-07-05T15:59:52.074+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3171 commit="7c26775a" tid="11152" timestamp=1720187992 INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="11152" timestamp=1720187992 total_threads=12 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="60942" tid="11152" timestamp=1720187992 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe time=2024-07-05T15:59:52.846+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4155.99 MiB llama_new_context_with_model: n_ctx = 3904 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 488.00 MiB llama_new_context_with_model: KV self size = 488.00 MiB, K (f16): 244.00 MiB, V (f16): 244.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 283.63 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 15.63 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 CUDA error: CUBLAS_STATUS_ALLOC_FAILED current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda/common.cuh:826 cublasCreate_v2(&cublas_handles[device]) GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:100: !"CUDA error" time=2024-07-05T15:59:57.275+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" time=2024-07-05T15:59:57.538+02:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 CUDA error\"" [GIN] 2024/07/05 - 15:59:57 | 500 | 5.5825397s | 127.0.0.1 | POST "/api/chat" time=2024-07-05T16:00:02.560+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.021286 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-05T16:00:02.811+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2725836 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-05T16:00:03.062+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5234575 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
@someone2018 your error looks like an OOM problem. We failed to partially load with 5G available on the GPU. Please make sure to update to the latest version and if you're still hitting the OOM crash, please let us know which model you were trying to load.
@someone2018 your error looks like an OOM problem. We failed to partially load with 5G available on the GPU. Please make sure to update to the latest version and if you're still hitting the OOM crash, please let us know which model you were trying to load.
Okay, I'm trying it out. The model I'm using is llama3
@someone2018 your error looks like an OOM problem. We failed to partially load with 5G available on the GPU. Please make sure to update to the latest version and if you're still hitting the OOM crash, please let us know which model you were trying to load.
Okay, I'm trying it out. The model I'm using is llama3
I'm going to close this one out. We should detect most failures and report a better error message now than 0xc0000409
and folks can find other similar issues to +1, or open new ones.
What is the issue?
when i run quantified model on v0.1.37,is errors out
Error: llama runner process has terminated: exit status 0xc0000409
first step:
secord step:
OS
Windows
GPU
Intel
CPU
Intel
Ollama version
v0.1.37