intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.21k stars 1.22k forks source link

ollama MTL iGPU issue #11425

Open kylinzhao90 opened 4 days ago

kylinzhao90 commented 4 days ago

I meet this issue while using ollama on MTL iGPU image my IPEX-LLM version as below image iGPU info as below image

rnwang04 commented 3 days ago

Hi @kylinzhao90 , for Linux users, we don't recommend to use pip installed OneAPI. Maybe you can try it again with conda deactivate & source /opt/intel/oneapi/setvars.sh (if you have followed our guide to apt install oneapi, https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#install-oneapi). If above method doesn't solve your issue, could you please update your detailed env info with env check script ?

kylinzhao90 commented 3 days ago

image

these are the outputs of env_chek:

(llm-cpp) root@O-E-M:~# bash env_check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240624
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             22
On-line CPU(s) list:                0-21
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Core(TM) Ultra 7 155H
CPU family:                         6
Model:                              170
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           4
CPU max MHz:                        4800.0000
CPU min MHz:                        400.0000
BogoMIPS:                           5990.40
-----------------------------------------------------------------
Total CPU Memory: 62.4737 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.4 LTS \n \l

-----------------------------------------------------------------
Linux O-E-M 6.5.0-18-generic #18~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb  7 11:40:03 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.35.20240425
    Build ID: 00000000

Service:
    Version: 1.2.35.20240425
    Build ID: 00000000
    Level Zero Version: 1.16.0
-----------------------------------------------------------------
env_check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
ii  intel-level-zero-gpu                       1.3.29138.29-881~22.04                  amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
env_check.sh: line 167: sycl-ls: command not found
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) Graphics                                               |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0200-0000-00087d558086                                       |
|           | PCI BDF Address: 0000:00:02.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16M
-----------------------------------------------------------------
00:02.0 VGA compatible controller: Intel Corporation Device 7d55 (rev 08) (prog-if 00 [VGA controller])
        DeviceName: Onboard - Video
        Subsystem: Intel Corporation Device 2212
        Flags: bus master, fast devsel, latency 0, IRQ 175, IOMMU group 0
        Memory at 4810000000 (64-bit, prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=256M]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
-----------------------------------------------------------------
(llm-cpp) root@O-E-M:~#
kylinzhao90 commented 3 days ago

I reinstalled OneAPI by apt, it has same error

(base) root@O-E-M:~# source /opt/intel/oneapi/setvars.sh

:: initializing oneAPI environment ...
   -bash: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments:
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: oneAPI environment initialized ::

(base) root@O-E-M:~# conda activate llm-cpp
(llm-cpp) root@O-E-M:~# cd ollama/
(llm-cpp) root@O-E-M:~/ollama# ls
ollama
(llm-cpp) root@O-E-M:~/ollama# ./ollama serve
2024/06/26 11:13:12 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-26T11:13:12.390+08:00 level=INFO source=images.go:729 msg="total blobs: 10"
time=2024-06-26T11:13:12.391+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-06-26T11:13:12.392+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-06-26T11:13:12.392+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2351670096/runners
time=2024-06-26T11:13:12.457+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cpu]"
[GIN] 2024/06/26 - 11:13:49 | 200 |      48.476µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/26 - 11:13:49 | 200 |     596.527µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/06/26 - 11:13:49 | 200 |     306.471µs |       127.0.0.1 | POST     "/api/show"
time=2024-06-26T11:13:49.344+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.9 GiB" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="768.0 MiB" memory.weights.total="2.2 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-06-26T11:13:49.344+08:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama2351670096/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 38005"
time=2024-06-26T11:13:49.345+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-26T11:13:49.345+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-26T11:13:49.345+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="adbd0dc" tid="140284799387648" timestamp=1719371629
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140284799387648" timestamp=1719371629 total_threads=22
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="21" port="38005" tid="140284799387648" timestamp=1719371629
llama_model_loader: loaded meta data with 26 key-value pairs and 195 tensors from /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:   81 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.23 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 62583M|            1.3.29138|
| 1|     [opencl:cpu:0]|                Intel Core Ultra 7 155H|    3.0|     22|    8192|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
| 2|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     22|67108864|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  2228.82 MiB
llm_load_tensors:        CPU buffer size =    52.84 MiB
time=2024-06-26T11:13:49.597+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.13 MiB
[1719371630] warming up the model with an empty run
llama_new_context_with_model:      SYCL0 compute buffer size =   168.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
oneapi::mkl::oneapi::mkl::blas::gemm: cannot allocate memory on host
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:15299, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"
[New LWP 53767]
[New LWP 53768]
[New LWP 53769]
[New LWP 53770]
[New LWP 53771]
[New LWP 53772]
[New LWP 53773]
[New LWP 53774]
[New LWP 53775]
[New LWP 53776]
[New LWP 53777]
[New LWP 53778]
[New LWP 53779]
[New LWP 53780]
[New LWP 53781]
[New LWP 53782]
[New LWP 53783]
[New LWP 53784]
[New LWP 53785]
[New LWP 53786]
[New LWP 53787]
[New LWP 53788]
[New LWP 53789]
[New LWP 53790]
[New LWP 53791]
[New LWP 53792]
[New LWP 53793]
[New LWP 53794]
[New LWP 53795]
[New LWP 53796]
[New LWP 53797]
[New LWP 53798]
[New LWP 53799]
[New LWP 53800]
[New LWP 53801]
[New LWP 53802]
[New LWP 53803]
[New LWP 53804]
[New LWP 53805]
[New LWP 53806]
[New LWP 53807]
[New LWP 53808]
[New LWP 53809]
[New LWP 53810]
[New LWP 53811]
[New LWP 53812]
[New LWP 53813]
[New LWP 53814]
[New LWP 53815]
[New LWP 53816]
[New LWP 53817]
[New LWP 53818]
[New LWP 53819]
[New LWP 53820]
time=2024-06-26T11:13:51.054+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding"
To enable execution of this file add
warning: File "/opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
        add-auto-load-safe-path /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py
line to your configuration file "/root/.config/gdb/gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/root/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f968c2ea42f in __GI___wait4 (pid=53821, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f968c2ea42f in __GI___wait4 (pid=53821, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00000000006b9426 in ggml_sycl_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#2  0x00000000006b4e67 in ggml_sycl_compute_forward(ggml_compute_params*, ggml_tensor*) ()
#3  0x000000000077006f in ggml_backend_sycl_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4  0x0000000000673a48 in ggml_backend_sched_graph_compute_async ()
#5  0x00000000005820db in llama_decode ()
#6  0x00000000005066b7 in llama_init_from_gpt_params(gpt_params&) ()
#7  0x000000000043db28 in main ()
[Inferior 1 (process 53766) detached]
time=2024-06-26T11:13:52.214+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
time=2024-06-26T11:13:52.488+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
time=2024-06-26T11:13:52.739+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!\n  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299\nGGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !\"SYCL error\""
[GIN] 2024/06/26 - 11:13:52 | 500 |  3.522210885s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/06/26 - 11:15:27 | 200 |      31.206µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/26 - 11:15:27 | 200 |     417.158µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/06/26 - 11:15:27 | 200 |     375.924µs |       127.0.0.1 | POST     "/api/show"
time=2024-06-26T11:15:27.897+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.9 GiB" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="768.0 MiB" memory.weights.total="2.2 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-06-26T11:15:27.897+08:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama2351670096/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 34589"
time=2024-06-26T11:15:27.898+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-26T11:15:27.898+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-26T11:15:27.898+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="adbd0dc" tid="140477976528896" timestamp=1719371727
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140477976528896" timestamp=1719371727 total_threads=22
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="21" port="34589" tid="140477976528896" timestamp=1719371727
llama_model_loader: loaded meta data with 26 key-value pairs and 195 tensors from /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:   81 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.23 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 62583M|            1.3.29138|
| 1|     [opencl:cpu:0]|                Intel Core Ultra 7 155H|    3.0|     22|    8192|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
| 2|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     22|67108864|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  2228.82 MiB
llm_load_tensors:        CPU buffer size =    52.84 MiB
time=2024-06-26T11:15:28.150+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.13 MiB
[1719371729] warming up the model with an empty run
llama_new_context_with_model:      SYCL0 compute buffer size =   168.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
oneapi::mkl::oneapi::mkl::blas::gemm: cannot allocate memory on host
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:15299, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"
rnwang04 commented 3 days ago

Hi @kylinzhao90 , does your llm-cpp has onednn / onemkl related package? If so, don't conda activate llm-cpp, just in your base conda env (which don't have pip installed oneapi) . Try conda deactivate and run ./ollama serve .

kylinzhao90 commented 3 days ago

llm-cpp has onednn / onemkl related package.

I have removed LD_LIBRARY_PATH config for envs

(base) root@O-E-M:~/ollama# conda env config vars list -n llm-cpp
(base) root@O-E-M:~/ollama#
(llm-cpp) root@O-E-M:~/ollama# pip list |grep onednn
onednn                   2024.0.0
(llm-cpp) root@O-E-M:~/ollama# pip list |grep onemkl
onemkl-sycl-blas         2024.0.0
onemkl-sycl-datafitting  2024.0.0
onemkl-sycl-dft          2024.0.0
onemkl-sycl-lapack       2024.0.0
onemkl-sycl-rng          2024.0.0
onemkl-sycl-sparse       2024.0.0
onemkl-sycl-stats        2024.0.0
onemkl-sycl-vm           2024.0.0
(llm-cpp) root@O-E-M:~/ollama#

and the same issue

(base) root@O-E-M:~# source /opt/intel/oneapi/setvars.sh

:: initializing oneAPI environment ...
   -bash: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments:
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: oneAPI environment initialized ::

(base) root@O-E-M:~# cd ollama/
(base) root@O-E-M:~/ollama# ./ollama serve
2024/06/26 12:34:06 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-26T12:34:06.129+08:00 level=INFO source=images.go:729 msg="total blobs: 10"
time=2024-06-26T12:34:06.129+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-06-26T12:34:06.130+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-06-26T12:34:06.130+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3082295135/runners
time=2024-06-26T12:34:06.191+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
[GIN] 2024/06/26 - 12:34:09 | 200 |      52.664µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/26 - 12:34:09 | 200 |     594.907µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/06/26 - 12:34:09 | 200 |     317.584µs |       127.0.0.1 | POST     "/api/show"
time=2024-06-26T12:34:09.320+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.9 GiB" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="768.0 MiB" memory.weights.total="2.2 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-06-26T12:34:09.320+08:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama3082295135/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 43809"
time=2024-06-26T12:34:09.321+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-26T12:34:09.321+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-26T12:34:09.321+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="adbd0dc" tid="140337736312832" timestamp=1719376449
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140337736312832" timestamp=1719376449 total_threads=22
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="21" port="43809" tid="140337736312832" timestamp=1719376449
llama_model_loader: loaded meta data with 26 key-value pairs and 195 tensors from /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:   81 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.23 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 62583M|            1.3.29138|
| 1|     [opencl:cpu:0]|                Intel Core Ultra 7 155H|    3.0|     22|    8192|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
| 2|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     22|67108864|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  2228.82 MiB
llm_load_tensors:        CPU buffer size =    52.84 MiB
time=2024-06-26T12:34:09.573+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.13 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   168.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
[1719376450] warming up the model with an empty run
oneapi::mkl::oneapi::mkl::blas::gemm: cannot allocate memory on host
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:15299, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"
[New LWP 56060]
[New LWP 56061]
[New LWP 56062]
[New LWP 56063]
[New LWP 56064]
[New LWP 56065]
[New LWP 56066]
[New LWP 56067]
[New LWP 56068]
[New LWP 56069]
[New LWP 56070]
[New LWP 56071]
[New LWP 56072]
[New LWP 56073]
[New LWP 56074]
[New LWP 56075]
[New LWP 56076]
[New LWP 56077]
[New LWP 56078]
[New LWP 56079]
[New LWP 56080]
[New LWP 56081]
[New LWP 56082]
[New LWP 56083]
[New LWP 56084]
[New LWP 56085]
[New LWP 56086]
[New LWP 56087]
[New LWP 56088]
[New LWP 56089]
[New LWP 56090]
[New LWP 56091]
[New LWP 56092]
[New LWP 56093]
[New LWP 56094]
[New LWP 56095]
[New LWP 56096]
[New LWP 56097]
[New LWP 56098]
[New LWP 56099]
[New LWP 56100]
[New LWP 56101]
[New LWP 56102]
[New LWP 56103]
[New LWP 56104]
[New LWP 56105]
[New LWP 56106]
[New LWP 56107]
[New LWP 56108]
[New LWP 56109]
[New LWP 56110]
[New LWP 56111]
[New LWP 56112]
[New LWP 56113]
time=2024-06-26T12:34:11.028+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding"
To enable execution of this file add
        add-auto-load-safe-path /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py
line to your configuration file "/root/.config/gdb/gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/root/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
warning: File "/opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fa2df6ea42f in __GI___wait4 (pid=56114, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007fa2df6ea42f in __GI___wait4 (pid=56114, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00000000006b9426 in ggml_sycl_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#2  0x00000000006b4e67 in ggml_sycl_compute_forward(ggml_compute_params*, ggml_tensor*) ()
#3  0x000000000077006f in ggml_backend_sycl_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4  0x0000000000673a48 in ggml_backend_sched_graph_compute_async ()
#5  0x00000000005820db in llama_decode ()
#6  0x00000000005066b7 in llama_init_from_gpt_params(gpt_params&) ()
#7  0x000000000043db28 in main ()
[Inferior 1 (process 56059) detached]
time=2024-06-26T12:34:12.182+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
time=2024-06-26T12:34:12.433+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!\n  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299\nGGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !\"SYCL error\""
[GIN] 2024/06/26 - 12:34:12 | 500 |  3.236072737s |       127.0.0.1 | POST     "/api/chat"
kylinzhao90 commented 2 days ago

@rnwang04 any comments for this issue?

rnwang04 commented 2 days ago

Sadly, we can't reproduce this issue on our Linux MTL (Intel(R) Core(TM) Ultra 5 125H) : image image

Regarding this error, the only related issue we have encountered is this: https://github.com/intel-analytics/ipex-llm/issues/10845, but is seems not work for you.

Here is our env info, just paste here for your reference:

-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240626
-----------------------------------------------------------------
IPEX is not installed. 
-----------------------------------------------------------------
CPU Information: 
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             18
On-line CPU(s) list:                0-17
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Core(TM) Ultra 5 125H
CPU family:                         6
Model:                              170
Thread(s) per core:                 2
Core(s) per socket:                 14
Socket(s):                          1
Stepping:                           4
CPU max MHz:                        4500.0000
CPU min MHz:                        400.0000
BogoMIPS:                           5990.40
-----------------------------------------------------------------
Total CPU Memory: 30.9502 GB
-----------------------------------------------------------------
Operating System: 
Ubuntu 22.04.3 LTS \n \l

-----------------------------------------------------------------
Linux xiaoxin04-ubuntu 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.22.20231126
    Build ID: 00000000

Service:
    Version: 1.2.22.20231126
    Build ID: 00000000
    Level Zero Version: 1.14.0
-----------------------------------------------------------------
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver Version                                  2024.17.3.0.08_160000
  Driver UUID                                     32342e30-392e-3238-3731-372e31320000
  Driver Version                                  24.09.28717.12
  Driver Version                                  2024.17.3.0.08_160000
-----------------------------------------------------------------
Driver related package version:
ii  intel-fw-gpu                                   2023.39.2-255~22.04                     all          Firmware package for Intel integrated and discrete GPUs
ii  intel-level-zero-gpu                           1.3.28717.12                            amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero-dev                                 1.14.0-744~22.04                        amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
igpu detected
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO  [24.09.28717.12]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.28717]
-----------------------------------------------------------------
xpu-smi is properly installed. 
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) Graphics                                               |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0200-0000-00087d558086                                       |
|           | PCI BDF Address: 0000:00:02.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory ize=256M
-----------------------------------------------------------------
00:02.0 VGA compatible controller: Intel Corporation Device 7d55 (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device 3cc9
        Flags: bus master, fast devsel, latency 0, IRQ 184, IOMMU group 0
        Memory at 408c000000 (64-bit, prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=256M]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
-----------------------------------------------------------------