ollama MTL iGPU issue - Githubissues

kylinzhao90 commented 4 days ago

I meet this issue while using ollama on MTL iGPU my IPEX-LLM version as below iGPU info as below

rnwang04 commented 3 days ago

Hi @kylinzhao90 ， for Linux users, we don't recommend to use pip installed OneAPI. Maybe you can try it again with conda deactivate & source /opt/intel/oneapi/setvars.sh (if you have followed our guide to apt install oneapi, https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#install-oneapi). If above method doesn't solve your issue, could you please update your detailed env info with env check script ?

kylinzhao90 commented 3 days ago

these are the outputs of env_chek:

(llm-cpp) root@O-E-M:~# bash env_check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240624
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             22
On-line CPU(s) list:                0-21
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Core(TM) Ultra 7 155H
CPU family:                         6
Model:                              170
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           4
CPU max MHz:                        4800.0000
CPU min MHz:                        400.0000
BogoMIPS:                           5990.40
-----------------------------------------------------------------
Total CPU Memory: 62.4737 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.4 LTS \n \l

-----------------------------------------------------------------
Linux O-E-M 6.5.0-18-generic #18~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb  7 11:40:03 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.35.20240425
    Build ID: 00000000

Service:
    Version: 1.2.35.20240425
    Build ID: 00000000
    Level Zero Version: 1.16.0
-----------------------------------------------------------------
env_check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
ii  intel-level-zero-gpu                       1.3.29138.29-881~22.04                  amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
env_check.sh: line 167: sycl-ls: command not found
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) Graphics                                               |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0200-0000-00087d558086                                       |
|           | PCI BDF Address: 0000:00:02.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16M
-----------------------------------------------------------------
00:02.0 VGA compatible controller: Intel Corporation Device 7d55 (rev 08) (prog-if 00 [VGA controller])
        DeviceName: Onboard - Video
        Subsystem: Intel Corporation Device 2212
        Flags: bus master, fast devsel, latency 0, IRQ 175, IOMMU group 0
        Memory at 4810000000 (64-bit, prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=256M]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
-----------------------------------------------------------------
(llm-cpp) root@O-E-M:~#

kylinzhao90 commented 3 days ago

I reinstalled OneAPI by apt, it has same error

(base) root@O-E-M:~# source /opt/intel/oneapi/setvars.sh

:: initializing oneAPI environment ...
   -bash: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments:
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: oneAPI environment initialized ::

(base) root@O-E-M:~# conda activate llm-cpp
(llm-cpp) root@O-E-M:~# cd ollama/
(llm-cpp) root@O-E-M:~/ollama# ls
ollama
(llm-cpp) root@O-E-M:~/ollama# ./ollama serve
2024/06/26 11:13:12 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-26T11:13:12.390+08:00 level=INFO source=images.go:729 msg="total blobs: 10"
time=2024-06-26T11:13:12.391+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-06-26T11:13:12.392+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-06-26T11:13:12.392+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2351670096/runners
time=2024-06-26T11:13:12.457+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cpu]"
[GIN] 2024/06/26 - 11:13:49 | 200 |      48.476µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/26 - 11:13:49 | 200 |     596.527µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/06/26 - 11:13:49 | 200 |     306.471µs |       127.0.0.1 | POST     "/api/show"
time=2024-06-26T11:13:49.344+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.9 GiB" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="768.0 MiB" memory.weights.total="2.2 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-06-26T11:13:49.344+08:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama2351670096/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 38005"
time=2024-06-26T11:13:49.345+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-26T11:13:49.345+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-26T11:13:49.345+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="adbd0dc" tid="140284799387648" timestamp=1719371629
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140284799387648" timestamp=1719371629 total_threads=22
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="21" port="38005" tid="140284799387648" timestamp=1719371629
llama_model_loader: loaded meta data with 26 key-value pairs and 195 tensors from /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:   81 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.23 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 62583M|            1.3.29138|
| 1|     [opencl:cpu:0]|                Intel Core Ultra 7 155H|    3.0|     22|    8192|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
| 2|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     22|67108864|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  2228.82 MiB
llm_load_tensors:        CPU buffer size =    52.84 MiB
time=2024-06-26T11:13:49.597+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.13 MiB
[1719371630] warming up the model with an empty run
llama_new_context_with_model:      SYCL0 compute buffer size =   168.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
oneapi::mkl::oneapi::mkl::blas::gemm: cannot allocate memory on host
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:15299, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"
[New LWP 53767]
[New LWP 53768]
[New LWP 53769]
[New LWP 53770]
[New LWP 53771]
[New LWP 53772]
[New LWP 53773]
[New LWP 53774]
[New LWP 53775]
[New LWP 53776]
[New LWP 53777]
[New LWP 53778]
[New LWP 53779]
[New LWP 53780]
[New LWP 53781]
[New LWP 53782]
[New LWP 53783]
[New LWP 53784]
[New LWP 53785]
[New LWP 53786]
[New LWP 53787]
[New LWP 53788]
[New LWP 53789]
[New LWP 53790]
[New LWP 53791]
[New LWP 53792]
[New LWP 53793]
[New LWP 53794]
[New LWP 53795]
[New LWP 53796]
[New LWP 53797]
[New LWP 53798]
[New LWP 53799]
[New LWP 53800]
[New LWP 53801]
[New LWP 53802]
[New LWP 53803]
[New LWP 53804]
[New LWP 53805]
[New LWP 53806]
[New LWP 53807]
[New LWP 53808]
[New LWP 53809]
[New LWP 53810]
[New LWP 53811]
[New LWP 53812]
[New LWP 53813]
[New LWP 53814]
[New LWP 53815]
[New LWP 53816]
[New LWP 53817]
[New LWP 53818]
[New LWP 53819]
[New LWP 53820]
time=2024-06-26T11:13:51.054+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding"
To enable execution of this file add
warning: File "/opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
        add-auto-load-safe-path /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py
line to your configuration file "/root/.config/gdb/gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/root/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f968c2ea42f in __GI___wait4 (pid=53821, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f968c2ea42f in __GI___wait4 (pid=53821, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00000000006b9426 in ggml_sycl_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#2  0x00000000006b4e67 in ggml_sycl_compute_forward(ggml_compute_params*, ggml_tensor*) ()
#3  0x000000000077006f in ggml_backend_sycl_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4  0x0000000000673a48 in ggml_backend_sched_graph_compute_async ()
#5  0x00000000005820db in llama_decode ()
#6  0x00000000005066b7 in llama_init_from_gpt_params(gpt_params&) ()
#7  0x000000000043db28 in main ()
[Inferior 1 (process 53766) detached]
time=2024-06-26T11:13:52.214+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
time=2024-06-26T11:13:52.488+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
time=2024-06-26T11:13:52.739+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!\n  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299\nGGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !\"SYCL error\""
[GIN] 2024/06/26 - 11:13:52 | 500 |  3.522210885s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/06/26 - 11:15:27 | 200 |      31.206µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/26 - 11:15:27 | 200 |     417.158µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/06/26 - 11:15:27 | 200 |     375.924µs |       127.0.0.1 | POST     "/api/show"
time=2024-06-26T11:15:27.897+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.9 GiB" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="768.0 MiB" memory.weights.total="2.2 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-06-26T11:15:27.897+08:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama2351670096/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 34589"
time=2024-06-26T11:15:27.898+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-26T11:15:27.898+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-26T11:15:27.898+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="adbd0dc" tid="140477976528896" timestamp=1719371727
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140477976528896" timestamp=1719371727 total_threads=22
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="21" port="34589" tid="140477976528896" timestamp=1719371727
llama_model_loader: loaded meta data with 26 key-value pairs and 195 tensors from /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:   81 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.23 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 62583M|            1.3.29138|
| 1|     [opencl:cpu:0]|                Intel Core Ultra 7 155H|    3.0|     22|    8192|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
| 2|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     22|67108864|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  2228.82 MiB
llm_load_tensors:        CPU buffer size =    52.84 MiB
time=2024-06-26T11:15:28.150+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.13 MiB
[1719371729] warming up the model with an empty run
llama_new_context_with_model:      SYCL0 compute buffer size =   168.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
oneapi::mkl::oneapi::mkl::blas::gemm: cannot allocate memory on host
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:15299, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"

rnwang04 commented 3 days ago

Hi @kylinzhao90 , does your llm-cpp has onednn / onemkl related package? If so, don't conda activate llm-cpp, just in your base conda env (which don't have pip installed oneapi) . Try conda deactivate and run ./ollama serve .

kylinzhao90 commented 3 days ago

llm-cpp has onednn / onemkl related package.

I have removed LD_LIBRARY_PATH config for envs

(base) root@O-E-M:~/ollama# conda env config vars list -n llm-cpp
(base) root@O-E-M:~/ollama#

(llm-cpp) root@O-E-M:~/ollama# pip list |grep onednn
onednn                   2024.0.0
(llm-cpp) root@O-E-M:~/ollama# pip list |grep onemkl
onemkl-sycl-blas         2024.0.0
onemkl-sycl-datafitting  2024.0.0
onemkl-sycl-dft          2024.0.0
onemkl-sycl-lapack       2024.0.0
onemkl-sycl-rng          2024.0.0
onemkl-sycl-sparse       2024.0.0
onemkl-sycl-stats        2024.0.0
onemkl-sycl-vm           2024.0.0
(llm-cpp) root@O-E-M:~/ollama#

and the same issue

(base) root@O-E-M:~# source /opt/intel/oneapi/setvars.sh

:: initializing oneAPI environment ...
   -bash: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments:
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: oneAPI environment initialized ::

(base) root@O-E-M:~# cd ollama/
(base) root@O-E-M:~/ollama# ./ollama serve
2024/06/26 12:34:06 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-26T12:34:06.129+08:00 level=INFO source=images.go:729 msg="total blobs: 10"
time=2024-06-26T12:34:06.129+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-06-26T12:34:06.130+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-06-26T12:34:06.130+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3082295135/runners
time=2024-06-26T12:34:06.191+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
[GIN] 2024/06/26 - 12:34:09 | 200 |      52.664µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/26 - 12:34:09 | 200 |     594.907µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/06/26 - 12:34:09 | 200 |     317.584µs |       127.0.0.1 | POST     "/api/show"
time=2024-06-26T12:34:09.320+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.9 GiB" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="768.0 MiB" memory.weights.total="2.2 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-06-26T12:34:09.320+08:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama3082295135/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 43809"
time=2024-06-26T12:34:09.321+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-26T12:34:09.321+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-26T12:34:09.321+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="adbd0dc" tid="140337736312832" timestamp=1719376449
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140337736312832" timestamp=1719376449 total_threads=22
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="21" port="43809" tid="140337736312832" timestamp=1719376449
llama_model_loader: loaded meta data with 26 key-value pairs and 195 tensors from /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:   81 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.23 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 62583M|            1.3.29138|
| 1|     [opencl:cpu:0]|                Intel Core Ultra 7 155H|    3.0|     22|    8192|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
| 2|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     22|67108864|   64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  2228.82 MiB
llm_load_tensors:        CPU buffer size =    52.84 MiB
time=2024-06-26T12:34:09.573+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.13 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   168.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
[1719376450] warming up the model with an empty run
oneapi::mkl::oneapi::mkl::blas::gemm: cannot allocate memory on host
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:15299, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"
[New LWP 56060]
[New LWP 56061]
[New LWP 56062]
[New LWP 56063]
[New LWP 56064]
[New LWP 56065]
[New LWP 56066]
[New LWP 56067]
[New LWP 56068]
[New LWP 56069]
[New LWP 56070]
[New LWP 56071]
[New LWP 56072]
[New LWP 56073]
[New LWP 56074]
[New LWP 56075]
[New LWP 56076]
[New LWP 56077]
[New LWP 56078]
[New LWP 56079]
[New LWP 56080]
[New LWP 56081]
[New LWP 56082]
[New LWP 56083]
[New LWP 56084]
[New LWP 56085]
[New LWP 56086]
[New LWP 56087]
[New LWP 56088]
[New LWP 56089]
[New LWP 56090]
[New LWP 56091]
[New LWP 56092]
[New LWP 56093]
[New LWP 56094]
[New LWP 56095]
[New LWP 56096]
[New LWP 56097]
[New LWP 56098]
[New LWP 56099]
[New LWP 56100]
[New LWP 56101]
[New LWP 56102]
[New LWP 56103]
[New LWP 56104]
[New LWP 56105]
[New LWP 56106]
[New LWP 56107]
[New LWP 56108]
[New LWP 56109]
[New LWP 56110]
[New LWP 56111]
[New LWP 56112]
[New LWP 56113]
time=2024-06-26T12:34:11.028+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding"
To enable execution of this file add
        add-auto-load-safe-path /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py
line to your configuration file "/root/.config/gdb/gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/root/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
warning: File "/opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fa2df6ea42f in __GI___wait4 (pid=56114, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007fa2df6ea42f in __GI___wait4 (pid=56114, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00000000006b9426 in ggml_sycl_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#2  0x00000000006b4e67 in ggml_sycl_compute_forward(ggml_compute_params*, ggml_tensor*) ()
#3  0x000000000077006f in ggml_backend_sycl_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4  0x0000000000673a48 in ggml_backend_sched_graph_compute_async ()
#5  0x00000000005820db in llama_decode ()
#6  0x00000000005066b7 in llama_init_from_gpt_params(gpt_params&) ()
#7  0x000000000043db28 in main ()
[Inferior 1 (process 56059) detached]
time=2024-06-26T12:34:12.182+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
time=2024-06-26T12:34:12.433+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!\n  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299\nGGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !\"SYCL error\""
[GIN] 2024/06/26 - 12:34:12 | 500 |  3.236072737s |       127.0.0.1 | POST     "/api/chat"

kylinzhao90 commented 2 days ago

@rnwang04 any comments for this issue?

rnwang04 commented 2 days ago

Sadly, we can't reproduce this issue on our Linux MTL (Intel(R) Core(TM) Ultra 5 125H) :

Regarding this error, the only related issue we have encountered is this: https://github.com/intel-analytics/ipex-llm/issues/10845, but is seems not work for you.

Here is our env info, just paste here for your reference:

-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240626
-----------------------------------------------------------------
IPEX is not installed. 
-----------------------------------------------------------------
CPU Information: 
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             18
On-line CPU(s) list:                0-17
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Core(TM) Ultra 5 125H
CPU family:                         6
Model:                              170
Thread(s) per core:                 2
Core(s) per socket:                 14
Socket(s):                          1
Stepping:                           4
CPU max MHz:                        4500.0000
CPU min MHz:                        400.0000
BogoMIPS:                           5990.40
-----------------------------------------------------------------
Total CPU Memory: 30.9502 GB
-----------------------------------------------------------------
Operating System: 
Ubuntu 22.04.3 LTS \n \l

-----------------------------------------------------------------
Linux xiaoxin04-ubuntu 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.22.20231126
    Build ID: 00000000

Service:
    Version: 1.2.22.20231126
    Build ID: 00000000
    Level Zero Version: 1.14.0
-----------------------------------------------------------------
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver Version                                  2024.17.3.0.08_160000
  Driver UUID                                     32342e30-392e-3238-3731-372e31320000
  Driver Version                                  24.09.28717.12
  Driver Version                                  2024.17.3.0.08_160000
-----------------------------------------------------------------
Driver related package version:
ii  intel-fw-gpu                                   2023.39.2-255~22.04                     all          Firmware package for Intel integrated and discrete GPUs
ii  intel-level-zero-gpu                           1.3.28717.12                            amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero-dev                                 1.14.0-744~22.04                        amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
igpu detected
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO  [24.09.28717.12]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.28717]
-----------------------------------------------------------------
xpu-smi is properly installed. 
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) Graphics                                               |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0200-0000-00087d558086                                       |
|           | PCI BDF Address: 0000:00:02.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory ize=256M
-----------------------------------------------------------------
00:02.0 VGA compatible controller: Intel Corporation Device 7d55 (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device 3cc9
        Flags: bus master, fast devsel, latency 0, IRQ 184, IOMMU group 0
        Memory at 408c000000 (64-bit, prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=256M]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
-----------------------------------------------------------------

intel-analytics / ipex-llm

ollama MTL iGPU issue #11425