cant run ollama in docker container with iGPU in linux

user7z commented 2 weeks ago

here is the the container parameters :

export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container podman run -itd \ --net=host \ --device=/dev/dri \ -v /home/user/.ollama:/root/.ollama \ -e no_proxy=localhost,127.0.0.1 \ --memory="32G" \ --name=$CONTAINER_NAME \ -e DEVICE=iGPU \ --shm-size="16g" \ $DOCKER_IMAGE cd scripts bash start-ollama.sh

source ipex-llm-init --gpu --device $DEVICE found oneapi in /opt/intel/oneapi/setvars.sh

:: initializing oneAPI environment ... bash: BASH_VERSION = 5.1.16(1)-release args: Using "$@" for setvars.sh arguments: --force :: advisor -- latest :: ccl -- latest :: compiler -- latest :: dal -- latest :: debugger -- latest :: dev-utilities -- latest :: dnnl -- latest :: dpcpp-ct -- latest :: dpl -- latest :: ipp -- latest :: ippcp -- latest :: mkl -- latest :: mpi -- latest :: tbb -- latest :: vtune -- latest :: oneAPI environment initialized ::

/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( root@lp:/llm/scripts# bash start-ollama.sh root@lp:/llm/scripts# 2024/11/08 00:35:57 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0: app:// file:// tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-11-08T00:35:57.378+08:00 level=INFO source=images.go:753 msg="total blobs: 6" time=2024-11-08T00:35:57.378+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" time=2024-11-08T00:35:57.379+08:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6-ipexllm-20241106)" time=2024-11-08T00:35:57.380+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama272927415/runners time=2024-11-08T00:35:57.504+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cpu cpu_avx]" time=2024-11-08T00:36:09.351+08:00 level=INFO source=gpu.go:168 msg="looking for compatible GPUs" time=2024-11-08T00:36:09.351+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries" time=2024-11-08T00:36:09.351+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries" time=2024-11-08T00:36:09.357+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries" time=2024-11-08T00:36:09.360+08:00 level=INFO source=gpu.go:280 msg="no compatible GPUs were discovered" time=2024-11-08T00:36:09.378+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=999 layers.model=31 layers.offload=0 layers.split="" memory.available="[26.2 GiB]" memory.required.full="434.7 MiB" memory.required.partial="0 B" memory.required.kv="180.0 MiB" memory.required.allocations="[434.7 MiB]" memory.weights.total="233.7 MiB" memory.weights.repeating="205.0 MiB" memory.weights.nonrepeating="28.7 MiB" memory.graph.full="164.5 MiB" memory.graph.partial="168.4 MiB" time=2024-11-08T00:36:09.379+08:00 level=INFO source=server.go:395 msg="starting llama server" cmd="/tmp/ollama272927415/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-55aa88ddac43adce6af0e9be8d6cdff2337a3835cd9b50bbcd7a894eb66dfc75 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 36063" time=2024-11-08T00:36:09.380+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2024-11-08T00:36:09.380+08:00 level=INFO source=server.go:595 msg="waiting for llama runner to start responding" time=2024-11-08T00:36:09.380+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error" llama_model_loader: loaded meta data with 33 key-value pairs and 272 tensors from /root/.ollama/models/blobs/sha256-55aa88ddac43adce6af0e9be8d6cdff2337a3835cd9b50bbcd7a894eb66dfc75 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Smollm2 135M 8k Lc100K Mix1 Ep2 llama_model_loader: - kv 3: general.organization str = HuggingFaceTB llama_model_loader: - kv 4: general.finetune str = 8k-lc100k-mix1-ep2 llama_model_loader: - kv 5: general.basename str = smollm2 llama_model_loader: - kv 6: general.size_label str = 135M llama_model_loader: - kv 7: general.license str = apache-2.0 llama_model_loader: - kv 8: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 9: llama.block_count u32 = 30 llama_model_loader: - kv 10: llama.context_length u32 = 8192 llama_model_loader: - kv 11: llama.embedding_length u32 = 576 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 1536 llama_model_loader: - kv 13: llama.attention.head_count u32 = 9 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 3 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 100000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 10 llama_model_loader: - kv 18: llama.vocab_size u32 = 49152 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 20: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 23: tokenizer.ggml.pre str = smollm llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,49152] = ["<	endoftext	>", "<	im_start	>", "<	... llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,48900] = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ... llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 29: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 31: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 32: general.quantization_version u32 = 2 llama_model_loader: - type f32: 61 tensors llama_model_loader: - type q8_0: 1 tensors llama_model_loader: - type q3_K: 30 tensors llama_model_loader: - type iq4_nl: 180 tensors llm_load_vocab: special tokens cache size = 17 llm_load_vocab: token to piece cache size = 0.3170 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 49152 llm_load_print_meta: n_merges = 48900 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 576 llm_load_print_meta: n_layer = 30 llm_load_print_meta: n_head = 9 llm_load_print_meta: n_head_kv = 3 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 192 llm_load_print_meta: n_embd_v_gqa = 192 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 100000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q2_K - Medium llm_load_print_meta: model params = 134.52 M llm_load_print_meta: model size = 82.41 MiB (5.14 BPW) llm_load_print_meta: general.name = Smollm2 135M 8k Lc100K Mix1 Ep2 llm_load_print_meta: BOS token = 1 '<	im_start	>' llm_load_print_meta: EOS token = 2 '<	im_end	>' llm_load_print_meta: UNK token = 0 '<	endoftext	>' llm_load_print_meta: PAD token = 2 '<	im_end	>' llm_load_print_meta: LF token = 143 'Ä' llm_load_print_meta: EOT token = 0 '<	endoftext	>' llm_load_print_meta: EOG token = 0 '<	endoftext	>' llm_load_print_meta: EOG token = 2 '<	im_end	>' llm_load_print_meta: max token length = 162 time=2024-11-08T00:36:09.632+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model" ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 1 SYCL devices: llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: offloading 30 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 31/31 layers to GPU llm_load_tensors: SYCL0 buffer size = 82.46 MiB llm_load_tensors: SYCL_Host buffer size = 28.69 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 100000.0 llama_new_context_with_model: freq_scale = 1 [SYCL] call ggml_check_sycl ggml_check_sycl: GGML_SYCL_DEBUG: 0 ggml_check_sycl: GGML_SYCL_F16: no found 1 SYCL devices:					Max		Max	Global							compute	Max work	sub	mem			ID	Device Type	Name	Version	units	group	group	size	Driver version
0	[level_zero:gpu:0]	Intel Graphics [0x46a8]	1.3	80	512	32	26651M	1.3.26241

llama_kv_cache_init: SYCL0 KV buffer size = 180.00 MiB llama_new_context_with_model: KV self size = 180.00 MiB, K (f16): 90.00 MiB, V (f16): 90.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.76 MiB llama_new_context_with_model: SYCL0 compute buffer size = 97.12 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 17.13 MiB llama_new_context_with_model: graph nodes = 846 llama_new_context_with_model: graph splits = 2 time=2024-11-08T00:36:15.414+08:00 level=INFO source=server.go:634 msg="llama runner started in 6.03 seconds" ollama_llama_server: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:439: auto ggml_sycl_op_sdp_xmx_casual(fp16 , fp16 , fp16 , fp16 , fp16 , float , size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.

sgwhat commented 2 weeks ago

Hi @user7z , could you provide your device configuration information?

user7z commented 2 weeks ago

@sgwhat its i5 1235-U alderlake that has iris Xe graphics card , i make it work for llama3.2 , didnt work with for example (smollm2), for llama it has a bad accuracy regression , try to chat with it , or say hello hi , and you'll see , but when it used with it within oldy open-webui , it fails directlly

user7z commented 2 weeks ago

@sgwhat gemm2 is the only one that works , and do poorely , phi3.5 at least lunchs , qwe2.5 misral models , llama3.2 do not work , one of the mistral models respond to my first message , after that i get assertion 'false' failed error , i only experience this with this docker image, also the official open-webui container works great , so i think there is no need to bload the gigantic docker image with it , its great if you provide one that just have a working ollama , without all the bloat , it might cause the poor performance with gemma2

sgwhat commented 2 weeks ago

which oneapi version have you installed in your container?

user7z commented 2 weeks ago

@sgwhat its a container it comes with oneapi , the version is the one you support under linux

hzjane commented 2 weeks ago

I can‘t reproduce the Assertion false failed error, maybe you could provide more infomation about how to reproduce it. And I meet the Incorrect output issue even though outside the docker image，we will fix it later.

user7z commented 2 weeks ago

@hzjane to reproduce : image : docker.io/intelanalytics/ipex-llm-inference-cpp-xpu:latest Run the container Go inside it cd scripts bash start-ollama.sh Open another terminal and do the same but instead kf runing ollama run bash start-openwebui.sh Go to the opebwebui in your browser and try those models : smollm2 didnt work at all Llama 3.2 work for a few chats ( one or two) Mistral same thing Qwen2.5 Those models that i tested ,also i tested Gemma2 it did work. You well notice a regression in the accuracy , & a perofrmance hit compared to the local setup , this was tested in an updated linux system with iris xe integrated gpu found in intel cpus , my one is i5 1235-U

hzjane commented 1 week ago

Smollm2 or gemma2 didn't work issue is fix by pr-12386. The output accuracy issue is still being fixed by @sgwhat .

intel-analytics / ipex-llm

cant run ollama in docker container with iGPU in linux #12363