likelovewant / ollama-for-amd

Get up and running with Llama 3, Mistral, Gemma, and other large language models.by adding more amd gpu support.
https://ollama.com
MIT License
159 stars 13 forks source link

运行模型报错:llama runner process has terminated: exit status 0xc0000139 #8

Closed yourchanges closed 3 months ago

yourchanges commented 3 months ago

What is the issue?

环境:

直接下的release 0.2.5 的ollama,显卡rx 570 gfx803 win10 64位

运行 ollama run qwen2:1.5b 或者 ollama run phi3 到报错, 请问是需要自己重新编译么 或者我的环境缺失什么依赖

日志

2024/07/19 14:10:24 routes.go:965: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:C:\\Users\\Administrator\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-19T14:10:24.927+08:00 level=INFO source=images.go:760 msg="total blobs: 0"
time=2024-07-19T14:10:24.927+08:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0"
time=2024-07-19T14:10:24.927+08:00 level=INFO source=routes.go:1012 msg="Listening on 127.0.0.1:11434 (version 0.2.5-0-gc7e2f88)"
time=2024-07-19T14:10:24.927+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 rocm_v5.7]"
time=2024-07-19T14:10:24.927+08:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-19T14:10:26.428+08:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=rocm compute=gfx803 driver=0.3 name="Radeon RX 570" total="4.0 GiB" available="3.9 GiB"
[GIN] 2024/07/19 - 14:12:30 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 14:12:30 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2024/07/19 - 14:14:56 | 200 |        24.4µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 14:14:56 | 404 |       569.4µs |       127.0.0.1 | POST     "/api/show"
time=2024-07-19T14:14:59.855+08:00 level=INFO source=download.go:136 msg="downloading 405b56374e02 in 10 100 MB part(s)"
time=2024-07-19T14:15:13.857+08:00 level=INFO source=download.go:251 msg="405b56374e02 part 4 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
time=2024-07-19T14:15:14.858+08:00 level=INFO source=download.go:251 msg="405b56374e02 part 2 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
time=2024-07-19T14:15:20.856+08:00 level=INFO source=download.go:251 msg="405b56374e02 part 3 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
time=2024-07-19T14:17:10.503+08:00 level=INFO source=download.go:136 msg="downloading 62fbfd9ed093 in 1 182 B part(s)"
time=2024-07-19T14:17:13.649+08:00 level=INFO source=download.go:136 msg="downloading c156170b718e in 1 11 KB part(s)"
time=2024-07-19T14:17:17.137+08:00 level=INFO source=download.go:136 msg="downloading f02dd72bb242 in 1 59 B part(s)"
time=2024-07-19T14:17:20.397+08:00 level=INFO source=download.go:136 msg="downloading c9f5e9ffbc5f in 1 485 B part(s)"
[GIN] 2024/07/19 - 14:17:24 | 200 |         2m28s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/07/19 - 14:17:24 | 200 |     16.9994ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-19T14:17:25.383+08:00 level=INFO source=sched.go:179 msg="one or more GPUs detected that are unable to accurately report free memory - disabling default concurrency"
time=2024-07-19T14:17:25.401+08:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Administrator\.ollama\models\blobs\sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e gpu=0 parallel=4 available=4160749568 required="1.9 GiB"
time=2024-07-19T14:17:25.401+08:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[3.9 GiB]" memory.required.full="1.9 GiB" memory.required.partial="1.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="927.4 MiB" memory.weights.repeating="744.8 MiB" memory.weights.nonrepeating="182.6 MiB" memory.graph.full="299.8 MiB" memory.graph.partial="482.3 MiB"
time=2024-07-19T14:17:25.407+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\rocm_v5.7\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 4 --port 50734"
time=2024-07-19T14:17:25.427+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-19T14:17:25.427+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-19T14:17:25.427+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
time=2024-07-19T14:17:25.691+08:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000139 "
[GIN] 2024/07/19 - 14:17:25 | 500 |    879.2946ms |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/19 - 14:18:07 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 14:18:07 | 200 |     16.8505ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-19T14:18:08.137+08:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Administrator\.ollama\models\blobs\sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e gpu=0 parallel=4 available=4160749568 required="1.9 GiB"
time=2024-07-19T14:18:08.137+08:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[3.9 GiB]" memory.required.full="1.9 GiB" memory.required.partial="1.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="927.4 MiB" memory.weights.repeating="744.8 MiB" memory.weights.nonrepeating="182.6 MiB" memory.graph.full="299.8 MiB" memory.graph.partial="482.3 MiB"
time=2024-07-19T14:18:08.142+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\rocm_v5.7\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 4 --port 50737"
time=2024-07-19T14:18:08.149+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-19T14:18:08.149+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-19T14:18:08.150+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
time=2024-07-19T14:18:08.411+08:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000139 "
[GIN] 2024/07/19 - 14:18:08 | 500 |    842.9484ms |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/19 - 14:18:24 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 14:18:24 | 200 |         552µs |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/07/19 - 14:19:29 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 14:19:29 | 404 |            0s |       127.0.0.1 | POST     "/api/show"

OS

Windows

GPU

AMD

CPU

Intel

Ollama version

0.2.5

yourchanges commented 3 months ago

v0.2.7 也是一样的错误,估计还是哪里缺少了

C:\Users\Administrator\AppData\Local\Programs\Ollama>ollama.exe serve
2024/07/19 14:42:38 routes.go:1096: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:C:\\Users\\Administrator\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-19T14:42:38.048+08:00 level=INFO source=images.go:778 msg="total blobs: 10"
time=2024-07-19T14:42:38.048+08:00 level=INFO source=images.go:785 msg="total unused blobs removed: 0"
time=2024-07-19T14:42:38.049+08:00 level=INFO source=routes.go:1143 msg="Listening on 127.0.0.1:11434 (version 0.2.7-1-g8f30d89)"
time=2024-07-19T14:42:38.050+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 rocm_v5.7]"
time=2024-07-19T14:42:38.051+08:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-19T14:42:38.640+08:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=rocm compute=gfx803 driver=0.3 name="Radeon RX 570" total="4.0 GiB" available="3.9 GiB"
[GIN] 2024/07/19 - 14:43:12 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 14:43:12 | 200 |       573.3µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2024/07/19 - 14:43:14 | 200 |       511.1µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 14:43:14 | 200 |      1.6895ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/07/19 - 14:43:25 | 200 |        49.2µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 14:43:25 | 200 |      4.8159ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-19T14:43:26.128+08:00 level=INFO source=sched.go:179 msg="one or more GPUs detected that are unable to accurately report free memory - disabling default concurrency"
time=2024-07-19T14:43:26.133+08:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Administrator\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a gpu=0 parallel=1 available=4160749568 required="3.4 GiB"
time=2024-07-19T14:43:26.134+08:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[3.9 GiB]" memory.required.full="3.4 GiB" memory.required.partial="3.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[3.4 GiB]" memory.weights.total="2.6 GiB" memory.weights.repeating="2.6 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-07-19T14:43:26.144+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\rocm_v5.7\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 50903"
time=2024-07-19T14:43:26.153+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-19T14:43:26.153+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-19T14:43:26.154+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
time=2024-07-19T14:43:26.409+08:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000139 "
[GIN] 2024/07/19 - 14:43:26 | 500 |    826.6527ms |       127.0.0.1 | POST     "/api/chat"
yourchanges commented 3 months ago

手动跑模型, 截图_6eabf346-e26e-4eca-94e3-37b69d50f544

yourchanges commented 3 months ago

都是按照文档安装的,这个hip信息:

C:\Program Files\AMD\ROCm\5.7\bin>hipInfo.exe

--------------------------------------------------------------------------------
device#                           0
Name:                             Radeon RX 570
pciBusID:                         1
pciDeviceID:                      0
pciDomainID:                      0
multiProcessorCount:              32
maxThreadsPerMultiProcessor:      2048
isMultiGpuBoard:                  0
clockRate:                        1244 Mhz
memoryClockRate:                  1750 Mhz
memoryBusWidth:                   0
totalGlobalMem:                   4.00 GB
totalConstMem:                    4026531840
sharedMemPerBlock:                64.00 KB
canMapHostMemory:                 1
regsPerBlock:                     0
warpSize:                         64
l2CacheSize:                      2097152
computeMode:                      0
maxThreadsPerBlock:               1024
maxThreadsDim.x:                  1024
maxThreadsDim.y:                  1024
maxThreadsDim.z:                  1024
maxGridSize.x:                    2147483647
maxGridSize.y:                    2147483647
maxGridSize.z:                    2147483647
major:                            8
minor:                            0
concurrentKernels:                1
cooperativeLaunch:                0
cooperativeMultiDeviceLaunch:     0
isIntegrated:                     0
maxTexture1D:                     16384
maxTexture2D.width:               16384
maxTexture2D.height:              16384
maxTexture3D.width:               2048
maxTexture3D.height:              2048
maxTexture3D.depth:               2048
isLargeBar:                       0
asicRevision:                     0
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate:             1000.00 Mhz
arch.hasGlobalInt32Atomics:       1
arch.hasGlobalFloatAtomicExch:    1
arch.hasSharedInt32Atomics:       1
arch.hasSharedFloatAtomicExch:    1
arch.hasFloatAtomicAdd:           1
arch.hasGlobalInt64Atomics:       1
arch.hasSharedInt64Atomics:       1
arch.hasDoubles:                  1
arch.hasWarpVote:                 1
arch.hasWarpBallot:               1
arch.hasWarpShuffle:              1
arch.hasFunnelShift:              0
arch.hasThreadFenceSystem:        1
arch.hasSyncThreadsExt:           0
arch.hasSurfaceFuncs:             0
arch.has3dGrid:                   1
arch.hasDynamicParallelism:       0
gcnArchName:                      gfx803
peers:
non-peers:                        device#0

memInfo.total:                    4.00 GB
memInfo.free:                     3.88 GB (97%)
yourchanges commented 3 months ago

It is confirmed that the issue is with the driver and the graphics card's BIOS. My old graphics card is a Lenovo MSI OEM RX 570. In the end, I flashed the BIOS file of the reference RX 580, and it worked. I can run large models of 3B, but after chatting for a while, the service program will exit automatically caused by hipErrorOutOfMemory .


D:\ollama_windows-amd64>ollama serve
2024/07/19 16:57:05 routes.go:965: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:C:\\Users\\Administrator\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:D:\\ollama_windows-amd64\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-19T16:57:05.791+08:00 level=INFO source=images.go:760 msg="total blobs: 10"
time=2024-07-19T16:57:05.794+08:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0"
time=2024-07-19T16:57:05.795+08:00 level=INFO source=routes.go:1012 msg="Listening on 127.0.0.1:11434 (version 0.2.5-0-gc7e2f88)"
time=2024-07-19T16:57:05.809+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 rocm_v5.7]"
time=2024-07-19T16:57:05.810+08:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-19T16:57:06.902+08:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=rocm compute=gfx803 driver=5.2 name="Radeon RX 570 Series" total="4.0 GiB" available="3.9 GiB"
[GIN] 2024/07/19 - 16:57:34 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 16:57:34 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2024/07/19 - 16:57:38 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 16:57:38 | 200 |      2.7351ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/07/19 - 16:57:45 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 16:57:45 | 200 |      9.0987ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-19T16:57:45.907+08:00 level=INFO source=sched.go:179 msg="one or more GPUs detected that are unable to accurately report free memory - disabling default concurrency"
time=2024-07-19T16:57:45.913+08:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Administrator\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a gpu=0 parallel=1 available=4160749568 required="3.4 GiB"
time=2024-07-19T16:57:45.914+08:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[3.9 GiB]" memory.required.full="3.4 GiB" memory.required.partial="3.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[3.4 GiB]" memory.weights.total="2.6 GiB" memory.weights.repeating="2.6 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-07-19T16:57:45.923+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="D:\\ollama_windows-amd64\\ollama_runners\\rocm_v5.7\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 61879"
time=2024-07-19T16:57:46.004+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-19T16:57:46.004+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-19T16:57:46.007+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="2348" timestamp=1721379466
INFO [wmain] system info | n_threads=3 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="2348" timestamp=1721379466 total_threads=6
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="5" port="61879" tid="2348" timestamp=1721379466
llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 131072
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type q4_0:  129 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 323
llm_load_vocab: token to piece cache size = 0.1690 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.03 GiB (4.55 BPW)
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
time=2024-07-19T16:57:46.531+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon RX 570 Series, compute capability 8.0, VMM: no
llm_load_tensors: ggml ctx size =    0.21 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  2021.84 MiB
llm_load_tensors:        CPU buffer size =    52.84 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
ggml_cuda_host_malloc: failed to allocate 0.13 MiB of pinned memory: hipErrorOutOfMemory
llama_new_context_with_model:        CPU  output buffer size =     0.13 MiB
ggml_cuda_host_malloc: failed to allocate 10.01 MiB of pinned memory: hipErrorOutOfMemory
llama_new_context_with_model:      ROCm0 compute buffer size =   168.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="2348" timestamp=1721379474
time=2024-07-19T16:57:54.283+08:00 level=INFO source=server.go:617 msg="llama runner started in 8.28 seconds"
[GIN] 2024/07/19 - 16:57:54 | 200 |    8.8032265s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/19 - 16:58:14 | 200 |     295.323ms |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/19 - 16:58:27 | 200 |    2.3140154s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/19 - 16:58:35 | 200 |    2.0796562s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/19 - 16:59:17 | 200 |    2.1118663s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/19 - 16:59:29 | 200 |     4.498826s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/19 - 16:59:39 | 200 |    4.8661064s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/19 - 16:59:54 | 200 |    2.9535855s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/19 - 17:00:05 | 200 |    2.9333785s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/19 - 17:00:26 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/19 - 17:00:26 | 200 |      6.0009ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-19T17:00:26.526+08:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Administrator\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a gpu=0 parallel=1 available=4160749568 required="3.4 GiB"
time=2024-07-19T17:00:26.527+08:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[3.9 GiB]" memory.required.full="3.4 GiB" memory.required.partial="3.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[3.4 GiB]" memory.weights.total="2.6 GiB" memory.weights.repeating="2.6 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-07-19T17:00:26.532+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="D:\\ollama_windows-amd64\\ollama_runners\\rocm_v5.7\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 61942"
time=2024-07-19T17:00:26.540+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-19T17:00:26.540+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-19T17:00:26.541+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="11908" timestamp=1721379626
INFO [wmain] system info | n_threads=3 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="11908" timestamp=1721379626 total_threads=6
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="5" port="61942" tid="11908" timestamp=1721379626
llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 131072
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type q4_0:  129 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 323
llm_load_vocab: token to piece cache size = 0.1690 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.03 GiB (4.55 BPW)
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
time=2024-07-19T17:00:26.805+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon RX 570 Series, compute capability 8.0, VMM: no
llm_load_tensors: ggml ctx size =    0.21 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  2021.84 MiB
llm_load_tensors:        CPU buffer size =    52.84 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
ggml_cuda_host_malloc: failed to allocate 0.13 MiB of pinned memory: hipErrorOutOfMemory
llama_new_context_with_model:        CPU  output buffer size =     0.13 MiB
ggml_cuda_host_malloc: failed to allocate 10.01 MiB of pinned memory: hipErrorOutOfMemory
llama_new_context_with_model:      ROCm0 compute buffer size =   168.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
time=2024-07-19T17:00:28.598+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
time=2024-07-19T17:00:28.849+08:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000005 "
[GIN] 2024/07/19 - 17:00:28 | 500 |    2.7604258s |       127.0.0.1 | POST     "/api/chat"
yourchanges commented 3 months ago

ollama run qwen2:1.5b always say

ggml_cuda_host_malloc: failed to allocate 2.34 MiB of pinned memory: hipErrorOutOfMemory llama_new_context_with_model: CPU output buffer size = 2.34 MiB ggml_cuda_host_malloc: failed to allocate 19.01 MiB of pinned memory: hipErrorOutOfMemory

time=2024-07-19T17:21:53.373+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="D:\\ollama_windows-amd64\\ollama_runners\\rocm_v5.7\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 4 --port 49772"
time=2024-07-19T17:21:53.478+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-19T17:21:53.478+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-19T17:21:53.480+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="4980" timestamp=1721380913
INFO [wmain] system info | n_threads=3 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="4980" timestamp=1721380914 total_threads=6
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="6" port="49772" tid="4980" timestamp=1721380914
llama_model_loader: loaded meta data with 21 key-value pairs and 338 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen2-1.5B-Instruct
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_0:  196 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-07-19T17:21:54.250+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 293
llm_load_vocab: token to piece cache size = 0.9338 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 1536
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8960
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 1.54 B
llm_load_print_meta: model size       = 885.97 MiB (4.81 BPW)
llm_load_print_meta: general.name     = Qwen2-1.5B-Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon RX 570 Series, compute capability 8.0, VMM: no
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:      ROCm0 buffer size =   885.97 MiB
llm_load_tensors:        CPU buffer size =   182.57 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
ggml_cuda_host_malloc: failed to allocate 2.34 MiB of pinned memory: hipErrorOutOfMemory
llama_new_context_with_model:        CPU  output buffer size =     2.34 MiB
ggml_cuda_host_malloc: failed to allocate 19.01 MiB of pinned memory: hipErrorOutOfMemory
llama_new_context_with_model:      ROCm0 compute buffer size =   299.75 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    19.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 2
time=2024-07-19T17:21:58.803+08:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000005 "
[GIN] 2024/07/19 - 17:21:58 | 500 |    5.8925205s |       127.0.0.1 | POST     "/api/chat"
likelovewant commented 3 months ago

time=2024-07-19T14:42:38.640+08:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=rocm compute=gfx803 driver=0.3 name="Radeon RX 570" total="4.0 GiB" available="3.9 GiB", based on this message , you have a low vram. try a smaller one .otherwise ,there is a turnaround try run the server sepearately . run ollama in progarm folder by " ./ollama serve , open a new terminal in same place ./ollama run qwen2:1.5b (eg ). also you may try this new libs rocm.gfx803.optic.test.version.7z, test if it has any improv. @yourchanges

yourchanges commented 3 months ago

@likelovewant thank you very much, I have tested the new libs, it runs more stable without vram out error, but when run with the qwen2:1.5b , it seems the model need more time to answer the same questions sometimes.

but for the low vram, I think it's great!!! and the rx 580 8GB vram is just about 350 yuan RMB, it will work more well with a very low costs.

likelovewant commented 3 months ago

@likelovewant thank you very much, I have tested the new libs, it runs more stable without vram out error, but when run with the qwen2:1.5b , it seems the model need more time to answer the same questions sometimes.

but for the low vram, I think it's great!!! and the rx 580 8GB vram is just about 350 yuan RMB, it will work more well with a very low costs.

@yourchanges Thanks for your feedbacks . That's great news for gfx803 users!

Qwen2 had some known issues on that architecture before, but the ollama issues section provides some helpful solutions. For example:

OLLAMA_FLASH_ATTENTION=True ollama serve
ollama run qwen2:7b-instruct-q8_0

You might also want to build your libraries . Building them based on this tutorial could help: wiki.

It involves testing different Vega logic versions, so it's not strictly necessary unless you want to explore further optimization.(the gfx802 optic test version are based on logic of vega10, you may test others if want a try)

I understand and can read your Chinese! I replied in English to make the information accessible to others who might have the same question. Let me know if you have any more questions or need further assistance. 😊

yourchanges commented 3 months ago

@likelovewant thank you for your quick reply, I'll try "OLLAMA_FLASH_ATTENTION=True ollama serve" later on the amd rx vii, it's more powerful, so I can get more efficiently

Lofanmi commented 3 months ago

@likelovewant thank you very much, I have tested the new libs, it runs more stable without vram out error, but when run with the qwen2:1.5b , it seems the model need more time to answer the same questions sometimes. but for the low vram, I think it's great!!! and the rx 580 8GB vram is just about 350 yuan RMB, it will work more well with a very low costs.

@yourchanges Thanks for your feedbacks . That's great news for gfx803 users!

Qwen2 had some known issues on that architecture before, but the ollama issues section provides some helpful solutions. For example:

* **Set `serve ollama` with `OLLAMA_FLASH_ATTENTION=True`:**
OLLAMA_FLASH_ATTENTION=True ollama serve
* **Then run your model:**
ollama run qwen2:7b-instruct-q8_0

You might also want to build your libraries . Building them based on this tutorial could help: wiki.

It involves testing different Vega logic versions, so it's not strictly necessary unless you want to explore further optimization.(the gfx802 optic test version are based on logic of vega10, you may test others if want a try)

I understand and can read your Chinese! I replied in English to make the information accessible to others who might have the same question. Let me know if you have any more questions or need further assistance. 😊

我试了 qwen2,聊多几次还是会 GGGGG……看了代码,这个环境变量好像没有生效?

因为程序的启动日志,没有按照预期填充了 --flash-attn 参数:

    if flashAttnEnabled {
        params = append(params, "--flash-attn")
    }

看样子是这段代码导致的?ollama 官方可能对 AMD 适配不太行。。。

    for _, g := range gpus {
        // only cuda (compute capability 7+) and metal support flash attention
        if g.Library != "metal" && (g.Library != "cuda" || g.DriverMajor < 7) {
            flashAttnEnabled = false
        }

        // mmap has issues with partial offloading on metal
        if g.Library == "metal" &&
            uint64(opts.NumGPU) > 0 &&
            uint64(opts.NumGPU) < ggml.KV().BlockCount()+1 {
            opts.UseMMap = new(bool)
            *opts.UseMMap = false
        }
    }
likelovewant commented 3 months ago

@likelovewant thank you very much, I have tested the new libs, it runs more stable without vram out error, but when run with the qwen2:1.5b , it seems the model need more time to answer the same questions sometimes. but for the low vram, I think it's great!!! and the rx 580 8GB vram is just about 350 yuan RMB, it will work more well with a very low costs.

@yourchanges Thanks for your feedbacks . That's great news for gfx803 users! Qwen2 had some known issues on that architecture before, but the ollama issues section provides some helpful solutions. For example:

* **Set `serve ollama` with `OLLAMA_FLASH_ATTENTION=True`:**
OLLAMA_FLASH_ATTENTION=True ollama serve
* **Then run your model:**
ollama run qwen2:7b-instruct-q8_0

You might also want to build your libraries . Building them based on this tutorial could help: wiki. It involves testing different Vega logic versions, so it's not strictly necessary unless you want to explore further optimization.(the gfx802 optic test version are based on logic of vega10, you may test others if want a try) I understand and can read your Chinese! I replied in English to make the information accessible to others who might have the same question. Let me know if you have any more questions or need further assistance. 😊

我试了 qwen2,聊多几次还是会 GGGGG……看了代码,这个环境变量好像没有生效?

因为程序的启动日志,没有按照预期填充了 --flash-attn 参数:

  if flashAttnEnabled {
      params = append(params, "--flash-attn")
  }

看样子是这段代码导致的?ollama 官方可能对 AMD 适配不太行。。。

  for _, g := range gpus {
      // only cuda (compute capability 7+) and metal support flash attention
      if g.Library != "metal" && (g.Library != "cuda" || g.DriverMajor < 7) {
          flashAttnEnabled = false
      }

      // mmap has issues with partial offloading on metal
      if g.Library == "metal" &&
          uint64(opts.NumGPU) > 0 &&
          uint64(opts.NumGPU) < ggml.KV().BlockCount()+1 {
          opts.UseMMap = new(bool)
          *opts.UseMMap = false
      }
  }

"I made an interesting discovery about Ollama's behavior.

likelovewant commented 3 months ago

the GGGGG out put fixed by the newupload v0.3.0.