Open user7z opened 2 weeks ago
Hi @user7z , could you provide your device configuration information?
@sgwhat its i5 1235-U alderlake that has iris Xe graphics card , i make it work for llama3.2 , didnt work with for example (smollm2), for llama it has a bad accuracy regression , try to chat with it , or say hello hi , and you'll see , but when it used with it within oldy open-webui , it fails directlly
@sgwhat gemm2 is the only one that works , and do poorely , phi3.5 at least lunchs , qwe2.5 misral models , llama3.2 do not work , one of the mistral models respond to my first message , after that i get assertion 'false' failed error , i only experience this with this docker image, also the official open-webui container works great , so i think there is no need to bload the gigantic docker image with it , its great if you provide one that just have a working ollama , without all the bloat , it might cause the poor performance with gemma2
which oneapi version have you installed in your container?
@sgwhat its a container it comes with oneapi , the version is the one you support under linux
I can‘t reproduce the Assertion false failed
error, maybe you could provide more infomation about how to reproduce it.
And I meet the Incorrect output
issue even though outside the docker image,we will fix it later.
@hzjane to reproduce : image : docker.io/intelanalytics/ipex-llm-inference-cpp-xpu:latest Run the container Go inside it cd scripts bash start-ollama.sh Open another terminal and do the same but instead kf runing ollama run bash start-openwebui.sh Go to the opebwebui in your browser and try those models : smollm2 didnt work at all Llama 3.2 work for a few chats ( one or two) Mistral same thing Qwen2.5 Those models that i tested ,also i tested Gemma2 it did work. You well notice a regression in the accuracy , & a perofrmance hit compared to the local setup , this was tested in an updated linux system with iris xe integrated gpu found in intel cpus , my one is i5 1235-U
here is the the container parameters :
export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container podman run -itd \ --net=host \ --device=/dev/dri \ -v /home/user/.ollama:/root/.ollama \ -e no_proxy=localhost,127.0.0.1 \ --memory="32G" \ --name=$CONTAINER_NAME \ -e DEVICE=iGPU \ --shm-size="16g" \ $DOCKER_IMAGE cd scripts bash start-ollama.sh
source ipex-llm-init --gpu --device $DEVICE found oneapi in /opt/intel/oneapi/setvars.sh
:: initializing oneAPI environment ... bash: BASH_VERSION = 5.1.16(1)-release args: Using "$@" for setvars.sh arguments: --force :: advisor -- latest :: ccl -- latest :: compiler -- latest :: dal -- latest :: debugger -- latest :: dev-utilities -- latest :: dnnl -- latest :: dpcpp-ct -- latest :: dpl -- latest :: ipp -- latest :: ippcp -- latest :: mkl -- latest :: mpi -- latest :: tbb -- latest :: vtune -- latest :: oneAPI environment initialized ::
llama_kv_cache_init: SYCL0 KV buffer size = 180.00 MiB llama_new_context_with_model: KV self size = 180.00 MiB, K (f16): 90.00 MiB, V (f16): 90.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.76 MiB llama_new_context_with_model: SYCL0 compute buffer size = 97.12 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 17.13 MiB llama_new_context_with_model: graph nodes = 846 llama_new_context_with_model: graph splits = 2 time=2024-11-08T00:36:15.414+08:00 level=INFO source=server.go:634 msg="llama runner started in 6.03 seconds" ollama_llama_server: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:439: auto ggml_sycl_op_sdp_xmx_casual(fp16 , fp16 , fp16 , fp16 , fp16 , float , size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.