likelovewant / ROCmLibs-for-gfx1103-AMD780M-APU

ROCm Library Files for gfx1103 and update with others arches based on AMD GPUs for use in Windows.
GNU General Public License v3.0
147 stars 17 forks source link

6600运行llama或者qwen缓慢,显存占用不正常 #13

Closed wsadaaa closed 4 weeks ago

wsadaaa commented 1 month ago

环境: romc 6.12 ollama 0.3.13 win 11,rx6600, i7 11700k, 32g

rocm的library和dll都替换了

ollama ps显示100%gpu,但任务管理器显示显卡显存占用不多,反而内存占用更多,似乎显卡没有全力运行

另一台机子5500显卡,运行速度比这个6600还快一些

求助大神! 01

02

likelovewant commented 1 month ago

可能原因 ollama 检测到的显卡有占用,因此给你自动分配了显存设置。有其他应用在占用显卡。通常重启一般就可以解决或者手动设置 num_thread ,num_gpu 来测试 参考 https://github.com/ollama/ollama/issues/2496https://github.com/ollama/ollama/issues/6008

wsadaaa commented 1 month ago

折腾一天还是没解决,我有点怀疑不知道是不是open webui或者AnythingLLM某个设置的影响,但我把它们都还原到默认设置甚至都卸载了还是不行。目前运行llama 3.2 3b是全部用显卡显存,速度正常。而llama 3.1 8,qwen 2.5 7.6b这样大小的模型就会占用大量内存,而显卡显存只用到了一半,导致速度很慢。 下面是ollama的log: 2024/10/21 07:25:09 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\AI\ollama models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0: app:// file:// tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-10-21T07:25:09.238+08:00 level=INFO source=images.go:754 msg="total blobs: 33" time=2024-10-21T07:25:09.258+08:00 level=INFO source=images.go:761 msg="total unused blobs removed: 0" time=2024-10-21T07:25:09.261+08:00 level=INFO source=routes.go:1205 msg="Listening on 127.0.0.1:11434 (version 0.3.13-1-geec4cd6)" time=2024-10-21T07:25:09.262+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v6.1]" time=2024-10-21T07:25:09.262+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" time=2024-10-21T07:25:09.519+08:00 level=INFO source=gpu.go:252 msg="error looking up nvidia GPU memory" error="cuda driver library failed to get device context 801" time=2024-10-21T07:25:10.018+08:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx1032 driver=6.2 name="AMD Radeon RX 6600" total="8.0 GiB" available="7.8 GiB" [GIN] 2024/10/21 - 07:25:10 | 200 | 34.8509ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/10/21 - 07:25:11 | 200 | 1.5668ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/10/21 - 07:25:11 | 200 | 0s | 127.0.0.1 | GET "/api/version" time=2024-10-21T07:25:23.219+08:00 level=INFO source=sched.go:185 msg="one or more GPUs detected that are unable to accurately report free memory - disabling default concurrency" time=2024-10-21T07:25:23.246+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model="D:\AI\ollama models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff" gpu=0 parallel=4 available=8278704128 required="3.7 GiB" time=2024-10-21T07:25:23.246+08:00 level=INFO source=server.go:108 msg="system memory" total="31.9 GiB" free="15.3 GiB" free_swap="14.3 GiB" time=2024-10-21T07:25:23.246+08:00 level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[7.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" time=2024-10-21T07:25:23.258+08:00 level=INFO source=server.go:399 msg="starting llama server" cmd="C:\Users\CP\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_v6.1\ollama_llama_server.exe --model D:\AI\ollama models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 4 --port 8675" time=2024-10-21T07:25:23.277+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-10-21T07:25:23.277+08:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2024-10-21T07:25:23.277+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error" INFO [wmain] starting c++ runner | tid="7716" timestamp=1729466723 INFO [wmain] build info | build=3670 commit="88c682cf" tid="7716" timestamp=1729466723 INFO [wmain] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="7716" timestamp=1729466723 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="8675" tid="7716" timestamp=1729466723 llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from D:\AI\ollama models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors time=2024-10-21T07:25:23.530+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6600, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.24 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: ROCm0 buffer size = 1918.36 MiB llm_load_tensors: CPU buffer size = 308.23 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 896.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 2.00 MiB llama_new_context_with_model: ROCm0 compute buffer size = 424.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 22.01 MiB llama_new_context_with_model: graph nodes = 902 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="7716" timestamp=1729466728 time=2024-10-21T07:25:28.205+08:00 level=INFO source=server.go:637 msg="llama runner started in 4.93 seconds" [GIN] 2024/10/21 - 07:25:32 | 200 | 9.8357832s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/10/21 - 07:25:32 | 200 | 232.0791ms | 127.0.0.1 | POST "/api/chat" time=2024-10-21T07:26:01.913+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model="D:\AI\ollama models\blobs\sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe" gpu=0 parallel=4 available=8131477504 required="6.2 GiB" time=2024-10-21T07:26:01.913+08:00 level=INFO source=server.go:108 msg="system memory" total="31.9 GiB" free="15.2 GiB" free_swap="14.0 GiB" time=2024-10-21T07:26:01.914+08:00 level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[7.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-10-21T07:26:01.921+08:00 level=INFO source=server.go:399 msg="starting llama server" cmd="C:\Users\CP\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_v6.1\ollama_llama_server.exe --model D:\AI\ollama models\blobs\sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 8699" time=2024-10-21T07:26:01.924+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-10-21T07:26:01.924+08:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2024-10-21T07:26:01.924+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error" INFO [wmain] starting c++ runner | tid="21172" timestamp=1729466761 INFO [wmain] build info | build=3670 commit="88c682cf" tid="21172" timestamp=1729466761 INFO [wmain] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="21172" timestamp=1729466761 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="8699" tid="21172" timestamp=1729466761 llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from D:\AI\ollama models\blobs\sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 2 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-10-21T07:26:02.184+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6600, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 4156.00 MiB llm_load_tensors: CPU buffer size = 281.81 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 2.02 MiB llama_new_context_with_model: ROCm0 compute buffer size = 560.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="21172" timestamp=1729466769 time=2024-10-21T07:26:09.466+08:00 level=INFO source=server.go:637 msg="llama runner started in 7.54 seconds" [GIN] 2024/10/21 - 07:26:41 | 200 | 40.8337865s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/10/21 - 07:26:47 | 200 | 2.1513ms | 127.0.0.1 | GET "/api/tags" time=2024-10-21T07:27:03.697+08:00 level=INFO source=server.go:108 msg="system memory" total="31.9 GiB" free="15.4 GiB" free_swap="14.1 GiB" time=2024-10-21T07:27:03.697+08:00 level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=256 layers.model=29 layers.offload=29 layers.split="" memory.available="[7.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.1 GiB" memory.required.partial="5.1 GiB" memory.required.kv="112.0 MiB" memory.required.allocations="[5.1 GiB]" memory.weights.total="3.8 GiB" memory.weights.repeating="3.3 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="304.0 MiB" memory.graph.partial="730.4 MiB" time=2024-10-21T07:27:03.705+08:00 level=INFO source=server.go:399 msg="starting llama server" cmd="C:\Users\CP\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_v6.1\ollama_llama_server.exe --model D:\AI\ollama models\blobs\sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 256 --parallel 1 --port 8729" time=2024-10-21T07:27:03.707+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-10-21T07:27:03.707+08:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2024-10-21T07:27:03.707+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error" INFO [wmain] starting c++ runner | tid="10064" timestamp=1729466823 INFO [wmain] build info | build=3670 commit="88c682cf" tid="10064" timestamp=1729466823 INFO [wmain] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="10064" timestamp=1729466823 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="8729" tid="10064" timestamp=1729466823 llama_model_loader: loaded meta data with 34 key-value pairs and 339 tensors from D:\AI\ollama models\blobs\sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen2.5 7B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Qwen2.5 llama_model_loader: - kv 5: general.size_label str = 7B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-7... llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 7B llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-7B llama_model_loader: - kv 12: general.tags arr[str,2] = ["chat", "text-generation"] llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 14: qwen2.block_count u32 = 28 llama_model_loader: - kv 15: qwen2.context_length u32 = 32768 llama_model_loader: - kv 16: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 22: general.file_type u32 = 15 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 169 tensors llama_model_loader: - type q6_K: 29 tensors llm_load_vocab: special tokens cache size = 22 time=2024-10-21T07:27:03.962+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: token to piece cache size = 0.9310 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 28 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18944 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 7.62 B llm_load_print_meta: model size = 4.36 GiB (4.91 BPW) llm_load_print_meta: general.name = Qwen2.5 7B Instruct llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6600, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: ROCm0 buffer size = 4168.09 MiB llm_load_tensors: CPU buffer size = 292.36 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 112.00 MiB llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.59 MiB llama_new_context_with_model: ROCm0 compute buffer size = 304.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 11.01 MiB llama_new_context_with_model: graph nodes = 986 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="10064" timestamp=1729466830 time=2024-10-21T07:27:10.974+08:00 level=INFO source=server.go:637 msg="llama runner started in 7.27 seconds" [GIN] 2024/10/21 - 07:27:26 | 200 | 1.1177ms | 127.0.0.1 | HEAD "/" [GIN] 2024/10/21 - 07:27:26 | 200 | 0s | 127.0.0.1 | GET "/api/ps"

likelovewant commented 1 month ago

time=2024-10-21T07:25:09.262+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v6.1]" 既然用这个版本的,可以直接在ollama https://github.com/ollama/ollama/issues 进行提问,官方版本https://github.com/ollama/ollama/blob/main/llm/generate/gen_windows.ps1 只支持"gfx1030""gfx1100" "gfx1101" "gfx1102" 看log llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading 28 repeating layers to GPU并没有什么问题。可以尝试这个版本 https://github.com/likelovewant/ollama-for-amd 然后替换rocm文件,再测试。或者自行编译也可以。

wsadaaa commented 1 month ago

大神,我用的是https://github.com/likelovewant/ollama-for-amd的版本,之前一直是低版本覆盖升级上来的,后来出现这个问题后,我又下载了官网的安装版OllamaSetup.exe安装后再用https://github.com/likelovewant/ollama-for-amd里的ollama-windows-amd64-rocm6.1.2.7z覆盖,最后再用rocm.gfx1032.for.hip.sdk.6.1.2替换rocm,运行情况还是一样。 如果不用您的ollama以及rocm替换,那只能100%cpu运行。

likelovewant commented 1 month ago

可以直接 下载 https://github.com/likelovewant/ollama-for-amd/releases/download/v0.3.13/OllamaSetup.exe ,我不确定是否是这个原因,但是你从log 显示一切正常, 我本地测试 8g的内存足够运行 qwen 7b 的模型,除了显示 cuda_v11 cuda_v12 等不一样,其他都正常。此外可以尝试彻底关闭ollama 运行的程序 , 通过 在 ollama 安装目录下运行 ./ollama ,再开启另一个终端窗口再运行模型 ./ollama run qwen2 手动启动看看情况 。 此外,还可以尝试升级驱动试试

wsadaaa commented 1 month ago

可以直接 下载 https://github.com/likelovewant/ollama-for-amd/releases/download/v0.3.13/OllamaSetup.exe ,我不确定是否是这个原因,但是你从log 显示一切正常, 我本地测试 8g的内存足够运行 qwen 7b 的模型,除了显示 cuda_v11 cuda_v12 等不一样,其他都正常。此外可以尝试彻底关闭ollama 运行的程序 , 通过 在 ollama 安装目录下运行 ./ollama ,再开启另一个终端窗口再运行模型 ./ollama run qwen2 手动启动看看情况 。 此外,还可以尝试升级驱动试试

各种都试了,还是不行。 暂时退回了0.3.6搭配rocm 5.7.7,显存使用就正常了。 这应该说明了是rocm 6.1.2的原因

likelovewant commented 1 month ago

ollama-windows-amd64-rocm-5.7.7z,v0.3.11 有rocm5.7 版

wsadaaa commented 1 month ago

ollama-windows-amd64-rocm-5.7.7z,v0.3.11 有rocm5.7 版

嗯,谢谢大神。 希望能帮忙解决6.1.2 rocm带来的前述问题,想用llama3.2, 感谢!

likelovewant commented 1 month ago

"error looking up nvidia GPU memory" error="cuda driver library failed to get device context 801" time=2024-10-21T07:25:10.018+08:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx1032 driver=6.2 name="AMD Radeon RX 6600" total="8.0 GiB" available="7.8 GiB"

可以尝试更新v0.3.14 ,并测试 这个rocm lib rocm.gfx1032.for.hip.sdk.6.1.2.optimized.Fremont.Dango.Version.7z,此外你这种情况属各例,可以尝试更新驱动试试。或者查看你的系统变量中有没有可能影响的其他设置

wsadaaa commented 1 month ago

已尝试0.3.14和Fremont.Dango.Version.7z,问题依旧 :(

likelovewant commented 1 month ago

那就不是rocm 本身的问题,可能是系统设置或驱动方面的影响。 备选项:1,尝试使用zluda 的方式使用官方版本的 替换cuda11 中的相关文件,进行体验证https://github.com/ollama/ollama/issues/4464 2.根据 https://github.com/likelovewant/ollama-for-amd/wiki 说明,自行编译Rocm5.7 的ollama。

wsadaaa commented 1 month ago

经过反复尝试,最终发现可能是新版本显卡驱动的原因。 把显卡驱动卸载后,重新安装6.1.2的hip,包括里面的显卡驱动。然后没有在官方hip安装下的rocm里覆盖1032的编译rocm,只是覆盖了ollama里的,显存使用不正常的问题就解决了。 感谢大神!!!