intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.75k stars 1.27k forks source link

cant run ollama in docker container with iGPU in linux #12363

Open user7z opened 2 weeks ago

user7z commented 2 weeks ago

here is the the container parameters :

export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container podman run -itd \ --net=host \ --device=/dev/dri \ -v /home/user/.ollama:/root/.ollama \ -e no_proxy=localhost,127.0.0.1 \ --memory="32G" \ --name=$CONTAINER_NAME \ -e DEVICE=iGPU \ --shm-size="16g" \ $DOCKER_IMAGE cd scripts bash start-ollama.sh

source ipex-llm-init --gpu --device $DEVICE found oneapi in /opt/intel/oneapi/setvars.sh

:: initializing oneAPI environment ... bash: BASH_VERSION = 5.1.16(1)-release args: Using "$@" for setvars.sh arguments: --force :: advisor -- latest :: ccl -- latest :: compiler -- latest :: dal -- latest :: debugger -- latest :: dev-utilities -- latest :: dnnl -- latest :: dpcpp-ct -- latest :: dpl -- latest :: ipp -- latest :: ippcp -- latest :: mkl -- latest :: mpi -- latest :: tbb -- latest :: vtune -- latest :: oneAPI environment initialized ::

/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( root@lp:/llm/scripts# bash start-ollama.sh root@lp:/llm/scripts# 2024/11/08 00:35:57 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0: app:// file:// tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-11-08T00:35:57.378+08:00 level=INFO source=images.go:753 msg="total blobs: 6" time=2024-11-08T00:35:57.378+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" time=2024-11-08T00:35:57.379+08:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6-ipexllm-20241106)" time=2024-11-08T00:35:57.380+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama272927415/runners time=2024-11-08T00:35:57.504+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cpu cpu_avx]" time=2024-11-08T00:36:09.351+08:00 level=INFO source=gpu.go:168 msg="looking for compatible GPUs" time=2024-11-08T00:36:09.351+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries" time=2024-11-08T00:36:09.351+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries" time=2024-11-08T00:36:09.357+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries" time=2024-11-08T00:36:09.360+08:00 level=INFO source=gpu.go:280 msg="no compatible GPUs were discovered" time=2024-11-08T00:36:09.378+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=999 layers.model=31 layers.offload=0 layers.split="" memory.available="[26.2 GiB]" memory.required.full="434.7 MiB" memory.required.partial="0 B" memory.required.kv="180.0 MiB" memory.required.allocations="[434.7 MiB]" memory.weights.total="233.7 MiB" memory.weights.repeating="205.0 MiB" memory.weights.nonrepeating="28.7 MiB" memory.graph.full="164.5 MiB" memory.graph.partial="168.4 MiB" time=2024-11-08T00:36:09.379+08:00 level=INFO source=server.go:395 msg="starting llama server" cmd="/tmp/ollama272927415/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-55aa88ddac43adce6af0e9be8d6cdff2337a3835cd9b50bbcd7a894eb66dfc75 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 36063" time=2024-11-08T00:36:09.380+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2024-11-08T00:36:09.380+08:00 level=INFO source=server.go:595 msg="waiting for llama runner to start responding" time=2024-11-08T00:36:09.380+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error" llama_model_loader: loaded meta data with 33 key-value pairs and 272 tensors from /root/.ollama/models/blobs/sha256-55aa88ddac43adce6af0e9be8d6cdff2337a3835cd9b50bbcd7a894eb66dfc75 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Smollm2 135M 8k Lc100K Mix1 Ep2 llama_model_loader: - kv 3: general.organization str = HuggingFaceTB llama_model_loader: - kv 4: general.finetune str = 8k-lc100k-mix1-ep2 llama_model_loader: - kv 5: general.basename str = smollm2 llama_model_loader: - kv 6: general.size_label str = 135M llama_model_loader: - kv 7: general.license str = apache-2.0 llama_model_loader: - kv 8: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 9: llama.block_count u32 = 30 llama_model_loader: - kv 10: llama.context_length u32 = 8192 llama_model_loader: - kv 11: llama.embedding_length u32 = 576 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 1536 llama_model_loader: - kv 13: llama.attention.head_count u32 = 9 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 3 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 100000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 10 llama_model_loader: - kv 18: llama.vocab_size u32 = 49152 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 20: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 23: tokenizer.ggml.pre str = smollm llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,49152] = ["< endoftext >", "< im_start >", "< ... llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,48900] = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ... llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 29: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 31: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 32: general.quantization_version u32 = 2 llama_model_loader: - type f32: 61 tensors llama_model_loader: - type q8_0: 1 tensors llama_model_loader: - type q3_K: 30 tensors llama_model_loader: - type iq4_nl: 180 tensors llm_load_vocab: special tokens cache size = 17 llm_load_vocab: token to piece cache size = 0.3170 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 49152 llm_load_print_meta: n_merges = 48900 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 576 llm_load_print_meta: n_layer = 30 llm_load_print_meta: n_head = 9 llm_load_print_meta: n_head_kv = 3 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 192 llm_load_print_meta: n_embd_v_gqa = 192 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 100000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q2_K - Medium llm_load_print_meta: model params = 134.52 M llm_load_print_meta: model size = 82.41 MiB (5.14 BPW) llm_load_print_meta: general.name = Smollm2 135M 8k Lc100K Mix1 Ep2 llm_load_print_meta: BOS token = 1 '< im_start >' llm_load_print_meta: EOS token = 2 '< im_end >' llm_load_print_meta: UNK token = 0 '< endoftext >' llm_load_print_meta: PAD token = 2 '< im_end >' llm_load_print_meta: LF token = 143 'Ä' llm_load_print_meta: EOT token = 0 '< endoftext >' llm_load_print_meta: EOG token = 0 '< endoftext >' llm_load_print_meta: EOG token = 2 '< im_end >' llm_load_print_meta: max token length = 162 time=2024-11-08T00:36:09.632+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model" ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 1 SYCL devices: llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: offloading 30 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 31/31 layers to GPU llm_load_tensors: SYCL0 buffer size = 82.46 MiB llm_load_tensors: SYCL_Host buffer size = 28.69 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 100000.0 llama_new_context_with_model: freq_scale = 1 [SYCL] call ggml_check_sycl ggml_check_sycl: GGML_SYCL_DEBUG: 0 ggml_check_sycl: GGML_SYCL_F16: no found 1 SYCL devices: Max Max Global compute Max work sub mem ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Graphics [0x46a8] 1.3 80 512 32 26651M 1.3.26241

llama_kv_cache_init: SYCL0 KV buffer size = 180.00 MiB llama_new_context_with_model: KV self size = 180.00 MiB, K (f16): 90.00 MiB, V (f16): 90.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.76 MiB llama_new_context_with_model: SYCL0 compute buffer size = 97.12 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 17.13 MiB llama_new_context_with_model: graph nodes = 846 llama_new_context_with_model: graph splits = 2 time=2024-11-08T00:36:15.414+08:00 level=INFO source=server.go:634 msg="llama runner started in 6.03 seconds" ollama_llama_server: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:439: auto ggml_sycl_op_sdp_xmx_casual(fp16 , fp16 , fp16 , fp16 , fp16 , float , size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.

sgwhat commented 2 weeks ago

Hi @user7z , could you provide your device configuration information?

user7z commented 2 weeks ago

@sgwhat its i5 1235-U alderlake that has iris Xe graphics card , i make it work for llama3.2 , didnt work with for example (smollm2), for llama it has a bad accuracy regression , try to chat with it , or say hello hi , and you'll see , but when it used with it within oldy open-webui , it fails directlly

user7z commented 2 weeks ago

@sgwhat gemm2 is the only one that works , and do poorely , phi3.5 at least lunchs , qwe2.5 misral models , llama3.2 do not work , one of the mistral models respond to my first message , after that i get assertion 'false' failed error , i only experience this with this docker image, also the official open-webui container works great , so i think there is no need to bload the gigantic docker image with it , its great if you provide one that just have a working ollama , without all the bloat , it might cause the poor performance with gemma2

sgwhat commented 2 weeks ago

which oneapi version have you installed in your container?

user7z commented 2 weeks ago

@sgwhat its a container it comes with oneapi , the version is the one you support under linux

hzjane commented 2 weeks ago

I can‘t reproduce the Assertion false failed error, maybe you could provide more infomation about how to reproduce it. And I meet the Incorrect output issue even though outside the docker image,we will fix it later.

user7z commented 2 weeks ago

@hzjane to reproduce : image : docker.io/intelanalytics/ipex-llm-inference-cpp-xpu:latest Run the container Go inside it cd scripts bash start-ollama.sh Open another terminal and do the same but instead kf runing ollama run bash start-openwebui.sh Go to the opebwebui in your browser and try those models : smollm2 didnt work at all Llama 3.2 work for a few chats ( one or two) Mistral same thing Qwen2.5 Those models that i tested ,also i tested Gemma2 it did work. You well notice a regression in the accuracy , & a perofrmance hit compared to the local setup , this was tested in an updated linux system with iris xe integrated gpu found in intel cpus , my one is i5 1235-U

hzjane commented 1 week ago

Smollm2 or gemma2 didn't work issue is fix by pr-12386. The output accuracy issue is still being fixed by @sgwhat .