intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.73k stars 1.27k forks source link

Brave Leo AI using Ollama and Intel GPU #12248

Open NikosDi opened 1 month ago

NikosDi commented 1 month ago

Hello. I'm trying to use Brave Leo AI with Ollama using an Intel GPU.

The instructions from Brave using local LLMs via Ollama are here: https://brave.com/blog/byom-nightly/

The instructions from Intel using Ollama with Intel GPU are here: https://www.intel.com/content/www/us/en/content-details/826081/running-ollama-with-open-webui-on-intel-hardware-platform.html

How could I combine those ?

I want to use Brave Leo AI (not Open WebUI) running on Intel GPU via Ollama.

My system: Windows 11/ Intel ARC A380

Thank you.

user7z commented 4 weeks ago

@NikosDi you should do one at a time , make sure when you do, ollama run #model , runs succesfully on your gpu , if it works then its easy to use it with brave , you just need to , ollama serve in a terminal and go to brave , & add local model ، for using ollama just copie the a description link they give you in the bracket , then use leo & select your model ، &thats it , to automate this i.e to have ollama serve automatically on boot you need a windows service i dont know if its possible to write one , in linux i did it , & it works fantastically. I recommend to you using page assist for configuring some basic ollama stuff , have a nice day

NikosDi commented 4 weeks ago

@user7z thanks for your reply.

i could be doing something wrong, but before I run "ollama serve" I use this script from cmd every time:

python -m venv llm_env C:\Windows\System32\llm_env\Scripts\activate.bat init-ollama.bat "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" set OLLAMA_NUM_GPU=999 set no_proxy=localhost,127.0.0.1 set ZES_ENABLE_SYSMAN=1 set SYCL_CACHE_PERSISTENT=1 ollama serve

For model name I use "llama3.1" and the path of server endpoint is "http://localhost:11434/v1/chat/completions"

Everything is offloaded to GPU and I get this error:

Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) Exception caught at file:C:/Users/Administrator/actions-runner/release-cpp-oneapi_2024_2/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml/src/ggml-sycl.cpp, line:5009, func:operator() SYCL error: CHECK_TRY_ERROR((stream) .memset(ctx->dev_ptr, value, buffer->size) .wait()): Meet error in this line code! in function ggml_backend_sycl_buffer_clear at C:/Users/Administrator/actions-runner/release-cpp-oneapi_2024_2/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml/src/ggml-sycl.cpp:5009 C:\Users\Administrator\actions-runner\release-cpp-oneapi_2024_2_work\llm.cpp\llm.cpp\ollama-internal\llm\llama.cpp\ggml\src\ggml-sycl\common.hpp:103: SYCL error time=2024-10-24T11:43:23.800+03:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server not responding" time=2024-10-24T11:43:25.529+03:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error" time=2024-10-24T11:43:25.783+03:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: error:CHECK_TRY_ERROR((stream) .memset(ctx->dev_ptr, value, buffer->size) .wait()): Meet error in this line code!\r\n in function ggml_backend_sycl_buffer_clear at C:/Users/Administrator/actions-runner/release-cpp-oneapi_2024_2/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml/src/ggml-sycl.cpp:5009\r\nC:\Users\Administrator\actions-runner\release-cpp-oneapi_2024_2\_work\llm.cpp\llm.cpp\ollama-internal\llm\llama.cpp\ggml\src\ggml-sycl\common.hpp:103: SYCL error" [GIN] 2024/10/24 - 11:43:25 | 500 | 11.2772703s | 127.0.0.1 | POST "/v1/chat/completions"

TIA

user7z commented 4 weeks ago

@NikosDi , youve miss guided or there is a windows bug ,read here carefully step by step , & see if you did something wrong , ollama should use llm-cpp as a backend wich it uses ipex-llm , plz forget about brave right know , & dont run the script , open the terminal yourself ، & serve , & try to chat on another terminal , this way you could debug the probleme more preciselly

sgwhat commented 4 weeks ago

hi @NikosDi , could you please provide full logs that returned on ollama server side? And you may follow this install windows gpu document and install ollama document to prepare your environment.

NikosDi commented 4 weeks ago

Hello @user7z, @sgwhat

As I wrote above I have followed a different guide based on the PDF which doesn't install/ use conda environment.

It's a PDF from Intel. https://www.intel.com/content/www/us/en/content-details/826081/running-ollama-with-open-webui-on-intel-hardware-platform.html

My exact system specifications are:

Windows 11 24H2 (26100.2033) - Intel ARC A380 - Drivers v6079.

The full log file of Ollama server is this:

Running the command line prompt as Nikos (administrator) Command-line.txt

Running the command line prompt as Administrator admin cmd.txt

TIA

user7z commented 4 weeks ago

@NikosDi you should install visual studio 2022 & select desktop C++ , as the guide suggest , then do ollama serve , then open another terminal and execute : ollama run llama3.2:1b & see if it runs Makw sure you did follow the guide that @sgwhat mentioned , the PDF is out dated bro , always check github guides because they gets updates often

NikosDi commented 4 weeks ago

@user7z

What is your GPU running Leo AI ?

All the guides are referring to A770, mine is A380. Looks like a hardware incompatibility eventually.

sgwhat commented 4 weeks ago

@NikosDi I think ipex-llm ollama supports A380, you may follow our github guides to try it.

NikosDi commented 4 weeks ago

@sgwhat I think I found out the problem, it is mentioned in your guide. https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md

Number 14 says:

  1. Native API failed error On latest version of ipex-llm, you might come across native API failed error with certain models without the -c parameter. Simply adding -c xx would resolve this problem.

Where should I add this -c parameter and what is -c xx ?

sgwhat commented 3 weeks ago

Hi @NikosDi , -c xx is the size of the prompt context, but I believe that ollama does not need to consider -c xx because it is a parameter used for llama.cpp.

Could you please provide the detailed log returned by ollama serve and the script you used to run ollama?

NikosDi commented 3 weeks ago

Hello @sgwhat

If you have already checked the text files above and didn't cover you, please give me some instructions how to provide the detailed log returned by ollama serve. Where can I find it? What is the name of the file ?

In those texts above I have included all the commands I use to run ollama and the response of my Windows environment.

But I can provide them again as a script, along with the detailed log if you assist with instructions.

Thank you.

sgwhat commented 3 weeks ago

Hi @NikosDi , I have checked your runtime log, please follow our github guide only to run ipex-llm ollama, and you may run it again without executing "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" before you running ollama serve.

NikosDi commented 3 weeks ago

Hello @sgwhat.

I have already installed Ollama from the official page https://ollama.com/download on two of my PCs (1 Windows 11, 1 Linux Ubuntu 24.04.1) and Brave Leo AI works like a charm on both of them using the CPU. It has extreme SIMD CPU optimizations, but CPUs are slow for such tasks.

Ollama installer has built-in support for NVIDIA CUDA and AMD ROCm but no Intel GPU support.

For Intel GPU I followed the guide from June 2024, I don't know if it's already obsolete.

I have already installed Intel's AI Playground on my Win 11 system, Ollama original CPU setup and the above Python environment for Intel GPU (Intel oneAPI, Python)

The truth is that it would be a lot more convenient if there was a single setup like AI Playground or similar in order to avoid manual installations for Ollama Intel GPU setup, since it's not built-in.

I'm not in a mood to follow another guide and install more environments (miniforge, conda etc)

If troubleshooting is impossible following Intel's guide from June 2024, then I have to stop here because we are doing circles.

Maybe it would be useful if you could tell me what I asked before, regarding the Native API failed error.

I'm using LLama3.1 as model and I would like to try adding -c xx and see the results, if you could assist where to add this parameter.

Thank you.

sgwhat commented 3 weeks ago

Hi @NikosDi , I have tested running ipex-llm ollama on windows11 Arc380 laptop and it works fine. Please follow the guide I mentioned before, and set OLLAMA_NUM_PARALLEL=1 before you running ollama serve.

Also please notice that do not call C:\Program Files (x86)\Intel\oneAPI\setvars.bat.

NikosDi commented 3 weeks ago

@sgwhat Oh man! This is insane!

It was just this "set OLLAMA_NUM_PARALLEL=1" parameter I have to add in the script. Everything works like a charm now.

The utilization of A380 is more than 90% sometimes even 99% and the speed of response is unbelievable, compared to my Core i7 9700, many times faster.

Also, the call of setvars.bat is mandatory, otherwise I get multiple missing DLL errors, like the one I post here. So, I call it every time, unless you can find another way. Screenshot 2024-10-30 121242

Thank you very much for the effort.

NikosDi commented 3 weeks ago

@sgwhat Well...

Unfortunately, even using the exact same model (Llama 3.1 - 8B) for CPU and Intel GPU on the exact same page, the results are completely different using Leo AI to summarize the page.

A380 is ~9x times faster than Core i7 9700, but the results are almost garbage sometimes.

It hallucinates a lot and always give me very short summaries compared to CPU version which is perfect.

Extremely slow but perfect, in both size and accuracy.

I don't know if there is anything that I could change in the parameters to enhance Intel GPU results in terms of quality, even slowing down the speed.

Thank you.

sgwhat commented 3 weeks ago

Hi @NikosDi , based on my tests, I haven't observed any noticeable difference in answer quality when running ipex-llm ollama on GPU or CPU, nor have I encountered issues with poor output on the A380. Could you provide the detailed responses from Leo AI?

NikosDi commented 3 weeks ago

Hello @sgwhat I think we are comparing different things.

My comparison is between Ollama default installation which has built-in support for CUDA and ROCm, so it falls back to CPU when running on Intel GPU hardware and Intel GPU using IPEX-LLM

It's default Ollama (CPU) vs IPEX (Intel GPU)

Your comparison is IPEX (CPU) vs IPEX (Intel GPU)

I'm interested in testing your comparison too, how could I run IPEX using my CPU ?

I'll post all my results here with all configurations.

Thank you.

sgwhat commented 2 weeks ago

I'm interested in testing your comparison too, how could I run IPEX using my CPU ?

You may set OLLAMA_NUM_GPU=0 before you strating ollama serve. Also I cannot see any answering quality issues when running llama3 with both ipex-llm ollama cpu and gpu.

NikosDi commented 2 weeks ago

Hello @sgwhat I did various test during weekend regarding different models using Brave Leo AI built-in cloud options, Ollama CPU, IPEX-LLM Intel GPU and now from your above suggestion, I added one more - IPEX-LLM Intel CPU.

So, using my Windows 11 PC and Intel ARC A380 I downloaded four different LLMs or we should call them SLM (Small Language Models)

Qwen2.5:7B, Mistral, LLama3.1:8B, Gemma2:9B

Using CPU mode I can run all of them obviously (I have 32GB of RAM) Using GPU mode I can run all of them besides Gemma2, as it needs more than 6GB VRAM

Regarding speed the results are extremely obvious.

IPEX CPU is definitely not on par with native Ollama CPU. IPEX CPU is V E R Y slow, Ollama CPU on the other hand is extremely AVX2 optimized.

I mean the difference is huge ~7 times faster Ollama CPU compared to IPEX CPU running on Intel Core i7 9700.

Intel GPU on the other hand is also ~9 times faster than Ollama CPU, so the differences are huge and clear.

Regarding quality, I can not say for sure. Using the same model, Ollama CPU, IPEX CPU and IPEX GPU give more or less similar results. I couldn't reproduce hallucinations, short answers and garbage output using my benchmark (small) page.

As a personal preference, probably I prefer Mistral as my favorite model running on Intel GPU of course.

Two questions:

1) In order to run Mistral on Intel GPU I had to change the parameter OLLAMA_NUM_PARALLEL > 1 (it works flawlessly for 2,3 and even 4) But if I set OLLAMA_NUM_PARALLEL > 1 then llama3.1 and Qwen2.5 do not work!

In order to run these models, I have to set again OLLAMA_NUM_PARALLEL=1 strictly.

Is it possible to change the script in a way to include support for Mistral and the other models using the same script ?

2) I installed Ubuntu 24.04.1 LTS on the above machine (Core i7 9700. ARC A380) and I would like to test IPEX-LLM on this Linux OS. Your guide and the PDF I mentioned above have instructions for Ubuntu 22.04, especially your guide is very strict about that.

I don't want to degrade my Kernel or install older Ubuntu version.

Are there newer instructions for Ubuntu 24.04 regarding IPEX-LLM installation ?

Thank you.

sgwhat commented 2 weeks ago

Hi @NikosDi,

  1. Setting ollama_num_parallel=1 helps reduce memory usage, which is currently the only effective way to run Llama and Qwen on the A380. Regarding your question, we will conduct some research and follow up with a response.
  2. We do not have instructions for Ubuntu 24.04. Please follow our official documentation to install it.
NikosDi commented 2 weeks ago

@sgwhat One last thing to add.

Maybe you could add in your research the possibility to embed IPEX-LLM inside official Ollama installer, just like nVidia CUDA and AMD ROCm.

Thank you

sgwhat commented 1 week ago

Hi @NikosDi.

Regarding your previous question 1, we tested and found that when OLLAMA_NUM_PARALLEL=1, ipex-llm ollama can run mistral correctly on Arc380, which means OLLAMA_NUM_PARALLEL=1 is compatible with multiple models. Also we are curious why you have to change the OLLAMA_NUM_PARALLEL to be greater than 1.

NikosDi commented 1 week ago

@sgwhat You are right. I can't reproduce the error, but it was the same like in the beginning using llama3.1 when I didn't have OLLAMA_NUM_PARALLEL=1 at all. The same complain about native API failed error.

So, I found out the issue.

I was changing models from inside Leo settings to test them and I was getting the error, but not only for Mistral. It was Mistral by coincidence that I noticed the issue.

If you load a model with the script for "Ollama serve" you have to stick with this model using Intel GPU IPEX-LLM, until it unloads from VRAM memory. Then, after memory unloading, you can change model for Leo AI without errors using the same Ollama session you are already running with OLLAMA_NUM_PARALLEL=1 for all models, including Mistral.

If you don't want to wait for the default 5 minutes, you can close and run again "Ollama serve" script or kill instantly the loaded model with this command:

curl http://localhost:11434/api/generate -d "{\"model\": \"$selected_model\", \"keep_alive\": 0}" where selected model is the one mentioned in "ollama ps" command (you need \ before " that github removes them automatically in this post)

When I was trying CPU models using IPEX-LLM and "ollama serve" script using OLLAMA_NUM_GPU=0, I didn't have such issues. All models were loaded in RAM at the same time and I could change the current model from Leo settings.

I don't know if the problem exists due to limited VRAM vs RAM (6GB vs 32GB) or it's a limitation of IPEX-LLM (GPU) vs IPEX-LLM (CPU)

sgwhat commented 1 week ago

@NikosDi, it's because the A380 only has 6GB of VRAM available.

NikosDi commented 4 days ago

@user7z @sgwhat

Windows: I managed to make a Windows service of the batch file (.bat) and now the Ollama IPEX-LLM is running automatically as a service.

Ubuntu 24.04: I managed also to create two different scripts (using conda and python directly) under Ubuntu 24.04 which both work flawlessly.

But I'm really struggling to make these scripts run from inside a script in order to make that script a service (to run automatically)

So this question is rather Linux related than Ollama or IPEX-LLM

How do I make these scripts (either one or both) run from inside a script ? (they both run directly from Terminal)

1) Python

python3.11 -m venv llm_env
source llm_env/bin/activate
init-ollama
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
export OLLAMA_NUM_PARALLEL=1
./ollama serve

2) Conda

conda activate llm-cpp
init-ollama
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
export OLLAMA_NUM_PARALLEL=1
./ollama serve
user7z commented 4 days ago

@NikosDi i understand what your trying to do , i spend days on this reading systemd man pages , i well go to my laptop where i have the service & give to you here .

user7z commented 4 days ago

@NikosDi .config/systemd/$user/conda-ipex.service

######################### [Unit] Description=Ollama Service Wants=network-online.target After=network.target network-online.target

[Service] ExecStart=/usr/bin/bash -c 'source /opt/intel/oneapi/setvars.sh && /home/$user/ollama serve'

WorkingDirectory=/var/lib/ollama

Environment="HOME=/var/lib/ollama"

Environment="OLLAMA_MODELS=/var/lib/ollama"

Environment="OLLAMA_NUM_GPU=999" Environment="no_proxy=localhost,127.0.0.1" Environment="BIGDL_LLM_XMX_DISABLED=1" Environment="ZES_ENABLE_SYSMAN=1" Environment="SYCL_CACHE_PERSISTENT=1" Environment="ONEAPI_DEVICE_SELECTOR=level_zero:0"

Restart=on-failure RestartSec=3 RestartPreventExitStatus=1 Type=oneshot

[Install] WantedBy=default.target ########################

systemd --user --daemon-reload systemctl --user enable --now conda-ipex.service

you might want to change some enviroment variables or delete ones , edit it accourding to your needs , i like to store my models in /var/lib/ollama , so they can be used with the distrbution packaged ollama , but it causes file permissions issues & probably would need some permissions adjustement , so you propably dont need the commented env variables.

NikosDi commented 4 days ago

After many hours of "chatting" with ChatGPT (the free version without subscription using model ChatGPT 4o mini) it finally works, using the conda environment:

Shell Script:

#!/bin/bash

# Activate conda environment
source /home/nikos/miniforge3/etc/profile.d/conda.sh  # Update this with the correct Conda path
conda activate llm-cpp

# Ensure init-ollama is in the PATH (adjust as needed)
export PATH="/home/nikos/llm_env/bin:$PATH"

# Initialize Ollama
init-ollama

# Set environment variables
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
export OLLAMA_NUM_PARALLEL=1

# Start Ollama
./ollama serve

Service script:

[Unit]
Description=Ollama using IPEX-LLM for Intel GPUs
After=network.target

[Service]
EnvironmentFile=/home/nikos/Documents/ollama_env
User=nikos
Group=nikos
WorkingDirectory=/home/nikos
ExecStart=/bin/bash -i -c 'source /home/nikos/miniforge3/etc/profile.d/conda.sh && conda activate llm-cpp && /home/nikos/Documents/Script_Intel_GPU_Linux.sh'
ExecStartPre=/bin/bash -i -c 'source /home/nikos/miniforge3/etc/profile.d/conda.sh && conda activate llm-cpp'
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

EnvironmentFile called by service script:

HOME=/home/nikos
CONDA_BASE=/home/nikos/miniforge3

# Ensure that Conda is in the PATH
PATH=$CONDA_BASE/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$PATH 

# Set library paths
LD_LIBRARY_PATH=$CONDA_BASE/lib:$LD_LIBRARY_PATH

@user7z Thank you for your answer and time!

user7z commented 4 days ago

@NikosDi you welcome , for security reasons it better to make it a user service

NikosDi commented 4 days ago

@user7z Next step is to find a nice gui for chatbot which can be used as client/ server app.

Because I want to use other devices in the same LAN which have no dGPU inside (only extremely slow iGPUs) in order to have Ollama hardware acceleration.

Have you tried to use Brave Leo AI via LAN (using the PC with dGPU as a server) ?

Have you found out any nice GUI app using Ollama chatbot for the same use (client/ server) via LAN ?

user7z commented 3 days ago

@NikosDi all what i did is for local use , just for my laptop ,but here is a nice portable gui i used before , its named open-webui

user7z commented 3 days ago

@NikosDi if it is what your looking for & you use podman instead of docker , i have a nice config if you need so