intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.55k stars 1.25k forks source link

Ollama on Windows not working #11270

Open jaymeanchante opened 3 months ago

jaymeanchante commented 3 months ago

Hello, first of all thanks for the amazing project.

I was able to run ipex with llama.cpp, it all worked fine, I was able to run on both CPU and GPU very fast. However, it didn't work for ollama.

Device: Samsung Book 3 360 OS: Windows 11 GPU: Iris Xe GPU Driver: 31.0.101.5534

I followed the step by step

conda create -n llm-cpp python=3.11 conda activate llm-cpp pip install --pre --upgrade ipex-llm[cpp]

mkdir ipex-ollama cd ipex-ollama

with admin priviledge

init-ollama.bat

it successfully creates the symlinks

set OLLAMA_NUM_GPU=999 set no_proxy=localhost,127.0.0.1 set ZES_ENABLE_SYSMAN=1 set SYCL_CACHE_PERSISTENT=1

in the tab of powershell I run

.\ollama.exe serve

in the other tab I run successfully

.\ollama.exe -v .\ollama.exe help .\ollama.exe pull phi3

I visited http://localhost:11434/ and it says "Ollama is running"

however if I run

.\ollama.exe run phi3

I get the following:

Error: llama runner process has terminated: exit status 0xc0000135

in the server I see

[GIN] 2024/06/09 - 20:46:01 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/06/09 - 20:46:01 | 200 | 505.3µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/06/09 - 20:46:01 | 200 | 563.5µs | 127.0.0.1 | POST "/api/show" time=2024-06-09T20:46:02.141+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="3.9 GiB" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="768.0 MiB" memory.weights.total="2.2 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB" time=2024-06-09T20:46:02.147+02:00 level=INFO source=server.go:342 msg="starting llama server" cmd="C:\Users\jayme\repos\ipex-ollama\dist\windows-amd64\ollama_runners\cpu_avx2\ollama_llama_server.exe --model C:\Users\jayme\.ollama\models\blobs\sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 53957" time=2024-06-09T20:46:02.150+02:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-09T20:46:02.151+02:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-09T20:46:02.151+02:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" time=2024-06-09T20:46:02.409+02:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000135 " [GIN] 2024/06/09 - 20:46:02 | 500 | 491.9745ms | 127.0.0.1 | POST "/api/chat"

sgwhat commented 3 months ago

Hi @jaymeanchante, we have reproduced your issue and we are woking on resolving it, will inform you when we make progress.

sgwhat commented 3 months ago

Hi @jaymeanchante, I can run ollama on windows with Intel Iris Xe (GPU driver 5534) successfully now, the reason I was able to reproduce your issue is that the GPU driver was not installed correctly, you may verify the env and run ollama as steps below:

  1. Run ls-sycl-device.exe to check your sycl devices, it's expected to get the results as below (it would be helpful for me to address this issue if you could provide the output).
found 3 SYCL devices:
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel Iris Xe Graphics|    1.3|     96|     512|   32|  7445M|            1.3.29283|
| 1|     [opencl:gpu:0]|                 Intel Iris Xe Graphics|    3.0|     96|     512|   32|  7445M|        31.0.101.5534|
| 2|     [opencl:cpu:0]|11th Gen Intel Core i7-1185G7 @ 3.00GHz|    3.0|      8|    8192|   64| -|-|
  1. If your output is different, please check the GPU driver installation and reinstall it. Based on my tests, both versions 5522 and 5534 of the GPU driver support Ollama running on Windows 11 with Intel Iris Xe.

Note: Please ensure that you are running ollama serve in your llm-cpp conda environment.

opticblu commented 3 months ago

Hi @jaymeanchante, I can run ollama on windows with Intel Iris Xe (GPU driver 5534) successfully now, the reason I was able to reproduce your issue is that the GPU driver was not installed correctly, you may verify the env and run ollama as steps below:

  1. Run ls-sycl-device.exe to check your sycl devices, it's expected to get the results as below (it would be helpful for me to address this issue if you could provide the output).
ls-sycl-device.exe
found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16704M|            1.3.29516|
| 1|     [opencl:gpu:0]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16704M|        31.0.101.5590|

That's the output of mine, still runs on CPU no matter what I do

sgwhat commented 3 months ago

That's the output of mine, still runs on CPU no matter what I do

Hi @opticblu,

  1. Could you please provide the detailed logs returned by the ollama server?
  2. Could you please run https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output?