ollama / ollama

Get up and running with Llama 3.1, Mistral, Gemma 2, and other large language models.
https://ollama.com
MIT License
81.32k stars 6.21k forks source link

Support Radeon RX 5700 XT (gfx1010) #2503

Open scabros opened 5 months ago

scabros commented 5 months ago

Hi! Congrats for the great project!

We were trying to test ollama with AMD GPU support and we struggled a bit because the install guides are not clear that CUDA libraries are required for ollama (or llama.cpp) to work properly even with team red GPUs.

The error when running ollama run llama2 was (leaving here for reference): ... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: ROCm_Host input buffer size = 13.01 MiB llama_new_context_with_model: ROCm0 compute buffer size = 164.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 3 CUDA error: shared object initialization failed current device: 0, in function ggml_cuda_op_flatten at /home/devel/ollama/llm/llama.cpp/ggml-cuda.cu:9208 hipGetLastError() loading library /tmp/ollama3700311510/rocm_v6/libext_server.so GGML_ASSERT: /home/devel/ollama/llm/llama.cpp/ggml-cuda.cu:241: !"CUDA error" [New LWP 4411] [New LWP 4412] [New LWP 4413] [New LWP 4414] [New LWP 4415] ...

after we installed the cuda libraries as per the instructions HERE the problem went away.

we also faced problems with ROCm 6.0.2 support for different gpu models (in our case it is a RX 5700 XT, arch gfx1010), the current binary packages doesn't contain TensileLibrary.dat (that somehow "maps" the kernels objects to use with different GPUs).

We had this error:

time=2024-02-09T17:05:38.481Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama3752973675/rocm_v6/libext_server.so" time=2024-02-09T17:05:38.481Z level=INFO source=dyn_ext_server.go:145 msg="Initializing llama server" rocBLAS error: Cannot read /opt/rocm/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1010 free(): invalid pointer SIGABRT: abort PC=0x7f4bb13739fc m=3 sigcode=18446744073709551610

so I downloaded full rocm source and tried to build again, just to get the right command for compiling TensileLibrary.dat :

this was the command that cmake uses: '/home/devel/rocBLAS/build/virtualenv/lib/python3.10/site-packages/Tensile/bin/TensileCreateLibrary' '--merge-files' '--separate-architectures' '--lazy-library-loading' '--no-short-file-names' '--library-print-debug' '--code-object-version=default' '--cxx-compiler=hipcc' '--jobs=14' '--library-format=msgpack' '--architecture=gfx1012' '/home/devel/rocBLAS/library/src/blas3/Tensile/Logic/asm_full' '/home/devel/rocBLAS/build/Tensile' 'HIP'

this is the command I used to generate a new TensileLibrary.dat: '/home/devel/rocBLAS/build/virtualenv/lib/python3.10/site-packages/Tensile/bin/TensileCreateLibrary' '--merge-files' '--no-short-file-names' '--library-print-debug' '--code-object-version=default' '--cxx-compiler=hipcc' '--jobs=14' '--library-format=msgpack' '/home/devel/rocBLAS/library/src/blas3/Tensile/Logic/asm_full' '/home/devel/rocBLAS/build/Tensile' 'HIP'

(i removed '--separate-architectures' '--lazy-library-loading' as per instructions in this bug )

Hope this helps to others! Thanks again!

dhiltgen commented 4 months ago

Unfortunately, the official ROCm builds from AMD don't currently support the RX 5700 XT.

With the new release 0.1.29, we'll now detect this incompatibility, and gracefully fall back to CPU mode and log some information in the server log about what happened. There is a facility to override this manually with HSA_OVERRIDE_GFX_VERSION however, I'm not sure what supported gfx target will work for this GPU.

AMD is working on updates to a future version of ROCm v6 that will support GPU families, so there's hope this will be added in the future. There's also a possibility to build the ROCm tensor library to add specific targets, however I haven't investigated the details on that yet.

jpmcb commented 4 months ago

It may be a good idea to keep an eye on ROCm/Tensile now that gfx1010 support has merged: https://github.com/ROCm/Tensile/pull/1897 which, from my understanding, once they cut a new version of that driver, it should flow into the main ROCm/ROCm build.

So, depending on how ollama detects GPU support, it may just start working.


I also ran into this and would love support of slightly older Radeon RX 5600:

❯ sudo ollama serve

...

time=2024-03-15T14:15:29.915-06:00 level=INFO source=gpu.go:77 msg="Detecting GPU type"
time=2024-03-15T14:15:29.915-06:00 level=INFO source=gpu.go:191 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-15T14:15:29.927-06:00 level=INFO source=gpu.go:237 msg="Discovered GPU libraries: []"
time=2024-03-15T14:15:29.927-06:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-15T14:15:29.927-06:00 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-03-15T14:15:29.928-06:00 level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx1010]"
time=2024-03-15T14:15:29.943-06:00 level=WARN source=amd_linux.go:114 msg="amdgpu [0] gfx1010 is not supported by /tmp/ollama402405963/rocm [gfx1030 gfx1100 gfx1101 gfx1102 gfx900 gfx906 gfx908 gfx90a gfx940 gfx941 gfx942]"
time=2024-03-15T14:15:29.943-06:00 level=WARN source=amd_linux.go:116 msg="See https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md for HSA_OVERRIDE_GFX_VERSION usage"
time=2024-03-15T14:15:29.943-06:00 level=INFO source=amd_linux.go:127 msg="all detected amdgpus are skipped, falling back to CPU"
time=2024-03-15T14:15:29.943-06:00 level=INFO source=routes.go:1133 msg="no GPU detected"

but thanks for building in detection of AMD gpu families that aren't supported: saved me lots of useless debugging time 👀 happy to help validate any patches: feel free to @ me

puccaso commented 4 months ago

Having this same issue - running suse tumbleweed. due to AMD only supporting opensuse enterprise, i can't actually get the ROCM drivers installed completely... i'd have to change to ubuntu.. i have thought about trying distrobox, but the requirements show a kernel has to be installed and dkim involved so i dont want to break current system testing this. i will eventually find a way though. love this project indeed!

aionik-me commented 3 months ago

The gfx1010 should work, but you'll need to manually override what's allowed and in some cases map it to closest supported type. I suggest you try nixos flake of ollama here: [https://github.com/abysssol/ollama-flake]

Only modifications that you may need for your "unsupported" gpu on nixos:

netdata is a good way to monitor remote, dig into specific periods of time, isolate when GPU is being used, or balanced, etc.

Please let me know if you have questions or feedkback.

skidunion commented 2 months ago

Match these to your closest GPU capabilities

@aionik-me could you provide some more insight on what that means? I followed the previous steps on Windows, but I just get gibberish when I try to prompt the model with anything. With CPU compute it works fine.

image

2024/05/25 16:52:11 routes.go:1008: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR:C:\\Users\\<user>\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_TMPDIR:]"
time=2024-05-25T16:52:11.474+03:00 level=INFO source=images.go:704 msg="total blobs: 20"
time=2024-05-25T16:52:11.475+03:00 level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-05-25T16:52:11.476+03:00 level=INFO source=routes.go:1054 msg="Listening on 192.168.0.100:11434 (version 0.1.38)"
time=2024-05-25T16:52:11.476+03:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [rocm_v5.7 cpu cpu_avx cpu_avx2 cuda_v11.3]"
time=2024-05-25T16:52:11.510+03:00 level=INFO source=amd_windows.go:63 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=10.3.0
time=2024-05-25T16:52:11.861+03:00 level=INFO source=types.go:71 msg="inference compute" id=0 library=rocm compute=gfx1010:xnack- driver=0.0 name="AMD Radeon RX 5700 XT" total="8.0 GiB" available="7.9 GiB"
time=2024-05-25T16:52:18.161+03:00 level=INFO source=amd_windows.go:63 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=10.3.0
time=2024-05-25T16:52:20.404+03:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="7.9 GiB" memory.required.full="5.4 GiB" memory.required.partial="5.4 GiB" memory.required.kv="512.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-05-25T16:52:20.404+03:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="7.9 GiB" memory.required.full="5.4 GiB" memory.required.partial="5.4 GiB" memory.required.kv="512.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-05-25T16:52:20.409+03:00 level=INFO source=server.go:320 msg="starting llama server" cmd="C:\\Users\\<user>\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\rocm_v5.7\\ollama_llama_server.exe --model C:\\Users\\<user>\\.ollama\\models\\blobs\\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 4096 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 49782"
time=2024-05-25T16:52:20.412+03:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-05-25T16:52:20.412+03:00 level=INFO source=server.go:504 msg="waiting for llama runner to start responding"
time=2024-05-25T16:52:20.412+03:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=2770 commit="952d03d" tid="9456" timestamp=1716645140
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="9456" timestamp=1716645140 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="49782" tid="9456" timestamp=1716645140
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\<user>\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-05-25T16:52:20.663+03:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'

rocBLAS warning: No paths matched C:\Users\<user>\AppData\Local\Programs\Ollama\rocm\\rocblas\library\*gfx1010*co. Make sure that ROCBLAS_TENSILE_LIBPATH is set correctly.
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  4155.99 MiB
llm_load_tensors:        CPU buffer size =   281.81 MiB
......................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="9456" timestamp=1716645143
time=2024-05-25T16:52:23.676+03:00 level=INFO source=server.go:545 msg="llama runner started in 3.26 seconds"
[GIN] 2024/05/25 - 16:52:31 | 200 |   13.7334707s |   192.168.0.106 | POST     "/api/chat"
Zippy-boy commented 1 month ago

Iv been trying to do the same with my RX5700. I noticed this when trying to serve: time=2024-06-08T13:06:06.875Z level=WARN source=amd_linux.go:296 msg="amdgpu is not supported" gpu=0 gpu_type=gfx1010 library=/opt/rocm/lib supported_types="[gfx1030 gfx1100 gfx1101 gfx1102 gfx900 gfx906 gfx908 gfx90a gfx940 gfx941 gfx942]"

Im pretty sure that the GFX1010 is supported my ROCm now, so is there an amdgpu supported_type file that i can add the gfx1010 to. And then see if it works with it? Im not very knowledgeable in this so this may be stupid :/

b0o commented 1 month ago

I was trying to run HSA_OVERRIDE_GFX_VERSION="10.1.0" ollama serve and getting Error: llama runner process has terminated: signal: aborted (core dumped) error:Cannot read /opt/rocm/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1010.

I saw a lot of suggestions to use HSA_OVERRIDE_GFX_VERSION="10.3.0" but that caused my GPU to crash.

However, I tried symlinking TensileLibrary_lazy_gfx1010.dat to TensileLibrary_lazy_gfx1030.dat: sudo ln -s /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx{1030,1010}.dat, and now it seems to be working!

Running HSA_OVERRIDE_GFX_VERSION="10.1.0" ollama serve and then running a model like ollama run mistral successfully loads and runs the model on my GPU.

thednp commented 1 month ago

Any news on this?

kdta91 commented 3 weeks ago

I was trying to run HSA_OVERRIDE_GFX_VERSION="10.1.0" ollama serve and getting Error: llama runner process has terminated: signal: aborted (core dumped) error:Cannot read /opt/rocm/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1010.

I saw a lot of suggestions to use HSA_OVERRIDE_GFX_VERSION="10.3.0" but that caused my GPU to crash.

However, I tried symlinking TensileLibrary_lazy_gfx1010.dat to TensileLibrary_lazy_gfx1030.dat: sudo ln -s /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx{1030,1010}.dat, and now it seems to be working!

Running HSA_OVERRIDE_GFX_VERSION="10.1.0" ollama serve and then running a model like ollama run mistral successfully loads and runs the model on my GPU.

This works for me in my case. Using RX 5600 XT, however, the VRAM usage is at constant >=5GB.

murlakatamenka commented 1 week ago

I was trying to run HSA_OVERRIDE_GFX_VERSION="10.1.0" ollama serve and getting Error: llama runner process has terminated: signal: aborted (core dumped) error:Cannot read /opt/rocm/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1010.

I saw a lot of suggestions to use HSA_OVERRIDE_GFX_VERSION="10.3.0" but that caused my GPU to crash.

However, I tried symlinking TensileLibrary_lazy_gfx1010.dat to TensileLibrary_lazy_gfx1030.dat: sudo ln -s /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx{1030,1010}.dat, and now it seems to be working!

Running HSA_OVERRIDE_GFX_VERSION="10.1.0" ollama serve and then running a model like ollama run mistral successfully loads and runs the model on my GPU.

Do you run Arch with ollama-rocm package?

b0o commented 1 week ago

Yes, I'm on Arch, and I'm using ollama-rocm-git but IIRC ollama-rocm worked as well.

Mr-Ples commented 1 week ago

guy has it working here https://github.com/ollama/ollama/issues/2453#issuecomment-2236193832