intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.66k stars 1.26k forks source link

Brave Leo AI using Ollama and Intel GPU #12248

Open NikosDi opened 1 week ago

NikosDi commented 1 week ago

Hello. I'm trying to use Brave Leo AI with Ollama using an Intel GPU.

The instructions from Brave using local LLMs via Ollama are here: https://brave.com/blog/byom-nightly/

The instructions from Intel using Ollama with Intel GPU are here: https://www.intel.com/content/www/us/en/content-details/826081/running-ollama-with-open-webui-on-intel-hardware-platform.html

How could I combine those ?

I want to use Brave Leo AI (not Open WebUI) running on Intel GPU via Ollama.

My system: Windows 11/ Intel ARC A380

Thank you.

user7z commented 1 week ago

@NikosDi you should do one at a time , make sure when you do, ollama run #model , runs succesfully on your gpu , if it works then its easy to use it with brave , you just need to , ollama serve in a terminal and go to brave , & add local model ، for using ollama just copie the a description link they give you in the bracket , then use leo & select your model ، &thats it , to automate this i.e to have ollama serve automatically on boot you need a windows service i dont know if its possible to write one , in linux i did it , & it works fantastically. I recommend to you using page assist for configuring some basic ollama stuff , have a nice day

NikosDi commented 1 week ago

@user7z thanks for your reply.

i could be doing something wrong, but before I run "ollama serve" I use this script from cmd every time:

python -m venv llm_env C:\Windows\System32\llm_env\Scripts\activate.bat init-ollama.bat "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" set OLLAMA_NUM_GPU=999 set no_proxy=localhost,127.0.0.1 set ZES_ENABLE_SYSMAN=1 set SYCL_CACHE_PERSISTENT=1 ollama serve

For model name I use "llama3.1" and the path of server endpoint is "http://localhost:11434/v1/chat/completions"

Everything is offloaded to GPU and I get this error:

Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) Exception caught at file:C:/Users/Administrator/actions-runner/release-cpp-oneapi_2024_2/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml/src/ggml-sycl.cpp, line:5009, func:operator() SYCL error: CHECK_TRY_ERROR((stream) .memset(ctx->dev_ptr, value, buffer->size) .wait()): Meet error in this line code! in function ggml_backend_sycl_buffer_clear at C:/Users/Administrator/actions-runner/release-cpp-oneapi_2024_2/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml/src/ggml-sycl.cpp:5009 C:\Users\Administrator\actions-runner\release-cpp-oneapi_2024_2_work\llm.cpp\llm.cpp\ollama-internal\llm\llama.cpp\ggml\src\ggml-sycl\common.hpp:103: SYCL error time=2024-10-24T11:43:23.800+03:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server not responding" time=2024-10-24T11:43:25.529+03:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error" time=2024-10-24T11:43:25.783+03:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: error:CHECK_TRY_ERROR((stream) .memset(ctx->dev_ptr, value, buffer->size) .wait()): Meet error in this line code!\r\n in function ggml_backend_sycl_buffer_clear at C:/Users/Administrator/actions-runner/release-cpp-oneapi_2024_2/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml/src/ggml-sycl.cpp:5009\r\nC:\Users\Administrator\actions-runner\release-cpp-oneapi_2024_2\_work\llm.cpp\llm.cpp\ollama-internal\llm\llama.cpp\ggml\src\ggml-sycl\common.hpp:103: SYCL error" [GIN] 2024/10/24 - 11:43:25 | 500 | 11.2772703s | 127.0.0.1 | POST "/v1/chat/completions"

TIA

user7z commented 1 week ago

@NikosDi , youve miss guided or there is a windows bug ,read here carefully step by step , & see if you did something wrong , ollama should use llm-cpp as a backend wich it uses ipex-llm , plz forget about brave right know , & dont run the script , open the terminal yourself ، & serve , & try to chat on another terminal , this way you could debug the probleme more preciselly

sgwhat commented 1 week ago

hi @NikosDi , could you please provide full logs that returned on ollama server side? And you may follow this install windows gpu document and install ollama document to prepare your environment.

NikosDi commented 1 week ago

Hello @user7z, @sgwhat

As I wrote above I have followed a different guide based on the PDF which doesn't install/ use conda environment.

It's a PDF from Intel. https://www.intel.com/content/www/us/en/content-details/826081/running-ollama-with-open-webui-on-intel-hardware-platform.html

My exact system specifications are:

Windows 11 24H2 (26100.2033) - Intel ARC A380 - Drivers v6079.

The full log file of Ollama server is this:

Running the command line prompt as Nikos (administrator) Command-line.txt

Running the command line prompt as Administrator admin cmd.txt

TIA

user7z commented 1 week ago

@NikosDi you should install visual studio 2022 & select desktop C++ , as the guide suggest , then do ollama serve , then open another terminal and execute : ollama run llama3.2:1b & see if it runs Makw sure you did follow the guide that @sgwhat mentioned , the PDF is out dated bro , always check github guides because they gets updates often

NikosDi commented 6 days ago

@user7z

What is your GPU running Leo AI ?

All the guides are referring to A770, mine is A380. Looks like a hardware incompatibility eventually.

sgwhat commented 6 days ago

@NikosDi I think ipex-llm ollama supports A380, you may follow our github guides to try it.

NikosDi commented 6 days ago

@sgwhat I think I found out the problem, it is mentioned in your guide. https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md

Number 14 says:

  1. Native API failed error On latest version of ipex-llm, you might come across native API failed error with certain models without the -c parameter. Simply adding -c xx would resolve this problem.

Where should I add this -c parameter and what is -c xx ?

sgwhat commented 4 days ago

Hi @NikosDi , -c xx is the size of the prompt context, but I believe that ollama does not need to consider -c xx because it is a parameter used for llama.cpp.

Could you please provide the detailed log returned by ollama serve and the script you used to run ollama?

NikosDi commented 3 days ago

Hello @sgwhat

If you have already checked the text files above and didn't cover you, please give me some instructions how to provide the detailed log returned by ollama serve. Where can I find it? What is the name of the file ?

In those texts above I have included all the commands I use to run ollama and the response of my Windows environment.

But I can provide them again as a script, along with the detailed log if you assist with instructions.

Thank you.

sgwhat commented 3 days ago

Hi @NikosDi , I have checked your runtime log, please follow our github guide only to run ipex-llm ollama, and you may run it again without executing "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" before you running ollama serve.

NikosDi commented 2 days ago

Hello @sgwhat.

I have already installed Ollama from the official page https://ollama.com/download on two of my PCs (1 Windows 11, 1 Linux Ubuntu 24.04.1) and Brave Leo AI works like a charm on both of them using the CPU. It has extreme SIMD CPU optimizations, but CPUs are slow for such tasks.

Ollama installer has built-in support for NVIDIA CUDA and AMD ROCm but no Intel GPU support.

For Intel GPU I followed the guide from June 2024, I don't know if it's already obsolete.

I have already installed Intel's AI Playground on my Win 11 system, Ollama original CPU setup and the above Python environment for Intel GPU (Intel oneAPI, Python)

The truth is that it would be a lot more convenient if there was a single setup like AI Playground or similar in order to avoid manual installations for Ollama Intel GPU setup, since it's not built-in.

I'm not in a mood to follow another guide and install more environments (miniforge, conda etc)

If troubleshooting is impossible following Intel's guide from June 2024, then I have to stop here because we are doing circles.

Maybe it would be useful if you could tell me what I asked before, regarding the Native API failed error.

I'm using LLama3.1 as model and I would like to try adding -c xx and see the results, if you could assist where to add this parameter.

Thank you.

sgwhat commented 2 days ago

Hi @NikosDi , I have tested running ipex-llm ollama on windows11 Arc380 laptop and it works fine. Please follow the guide I mentioned before, and set OLLAMA_NUM_PARALLEL=1 before you running ollama serve.

Also please notice that do not call C:\Program Files (x86)\Intel\oneAPI\setvars.bat.

NikosDi commented 1 day ago

@sgwhat Oh man! This is insane!

It was just this "set OLLAMA_NUM_PARALLEL=1" parameter I have to add in the script. Everything works like a charm now.

The utilization of A380 is more than 90% sometimes even 99% and the speed of response is unbelievable, compared to my Core i7 9700, many times faster.

Also, the call of setvars.bat is mandatory, otherwise I get multiple missing DLL errors, like the one I post here. So, I call it every time, unless you can find another way. Screenshot 2024-10-30 121242

Thank you very much for the effort.

NikosDi commented 1 day ago

@sgwhat Well...

Unfortunately, even using the exact same model (Llama 3.1 - 8B) for CPU and Intel GPU on the exact same page, the results are completely different using Leo AI to summarize the page.

A380 is ~9x times faster than Core i7 9700, but the results are almost garbage sometimes.

It hallucinates a lot and always give me very short summaries compared to CPU version which is perfect.

Extremely slow but perfect, in both size and accuracy.

I don't know if there is anything that I could change in the parameters to enhance Intel GPU results in terms of quality, even slowing down the speed.

Thank you.

sgwhat commented 2 hours ago

Hi @NikosDi , based on my tests, I haven't observed any noticeable difference in answer quality when running ipex-llm ollama on GPU or CPU, nor have I encountered issues with poor output on the A380. Could you provide the detailed responses from Leo AI?