Open kylinzhao90 opened 4 days ago
Hi @kylinzhao90 , for Linux users, we don't recommend to use pip installed OneAPI.
Maybe you can try it again with conda deactivate
& source /opt/intel/oneapi/setvars.sh
(if you have followed our guide to apt install oneapi, https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#install-oneapi).
If above method doesn't solve your issue, could you please update your detailed env info with env check script ?
these are the outputs of env_chek:
(llm-cpp) root@O-E-M:~# bash env_check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240624
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 22
On-line CPU(s) list: 0-21
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) Ultra 7 155H
CPU family: 6
Model: 170
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 4
CPU max MHz: 4800.0000
CPU min MHz: 400.0000
BogoMIPS: 5990.40
-----------------------------------------------------------------
Total CPU Memory: 62.4737 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.4 LTS \n \l
-----------------------------------------------------------------
Linux O-E-M 6.5.0-18-generic #18~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 7 11:40:03 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
Version: 1.2.35.20240425
Build ID: 00000000
Service:
Version: 1.2.35.20240425
Build ID: 00000000
Level Zero Version: 1.16.0
-----------------------------------------------------------------
env_check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
ii intel-level-zero-gpu 1.3.29138.29-881~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
env_check.sh: line 167: sycl-ls: command not found
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) Arc(TM) Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0200-0000-00087d558086 |
| | PCI BDF Address: 0000:00:02.0 |
| | DRM Device: /dev/dri/card0 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16M
-----------------------------------------------------------------
00:02.0 VGA compatible controller: Intel Corporation Device 7d55 (rev 08) (prog-if 00 [VGA controller])
DeviceName: Onboard - Video
Subsystem: Intel Corporation Device 2212
Flags: bus master, fast devsel, latency 0, IRQ 175, IOMMU group 0
Memory at 4810000000 (64-bit, prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=256M]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
-----------------------------------------------------------------
(llm-cpp) root@O-E-M:~#
I reinstalled OneAPI by apt, it has same error
(base) root@O-E-M:~# source /opt/intel/oneapi/setvars.sh
:: initializing oneAPI environment ...
-bash: BASH_VERSION = 5.1.16(1)-release
args: Using "$@" for setvars.sh arguments:
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: oneAPI environment initialized ::
(base) root@O-E-M:~# conda activate llm-cpp
(llm-cpp) root@O-E-M:~# cd ollama/
(llm-cpp) root@O-E-M:~/ollama# ls
ollama
(llm-cpp) root@O-E-M:~/ollama# ./ollama serve
2024/06/26 11:13:12 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-26T11:13:12.390+08:00 level=INFO source=images.go:729 msg="total blobs: 10"
time=2024-06-26T11:13:12.391+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env: export GIN_MODE=release
- using code: gin.SetMode(gin.ReleaseMode)
[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-06-26T11:13:12.392+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-06-26T11:13:12.392+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2351670096/runners
time=2024-06-26T11:13:12.457+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cpu]"
[GIN] 2024/06/26 - 11:13:49 | 200 | 48.476µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/06/26 - 11:13:49 | 200 | 596.527µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/06/26 - 11:13:49 | 200 | 306.471µs | 127.0.0.1 | POST "/api/show"
time=2024-06-26T11:13:49.344+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.9 GiB" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="768.0 MiB" memory.weights.total="2.2 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-06-26T11:13:49.344+08:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama2351670096/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 38005"
time=2024-06-26T11:13:49.345+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-26T11:13:49.345+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-26T11:13:49.345+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="adbd0dc" tid="140284799387648" timestamp=1719371629
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140284799387648" timestamp=1719371629 total_threads=22
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="21" port="38005" tid="140284799387648" timestamp=1719371629
llama_model_loader: loaded meta data with 26 key-value pairs and 195 tensors from /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi3
llama_model_loader: - kv 1: general.name str = Phi3
llama_model_loader: - kv 2: phi3.context_length u32 = 4096
llama_model_loader: - kv 3: phi3.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 4: phi3.embedding_length u32 = 3072
llama_model_loader: - kv 5: phi3.feed_forward_length u32 = 8192
llama_model_loader: - kv 6: phi3.block_count u32 = 32
llama_model_loader: - kv 7: phi3.attention.head_count u32 = 32
llama_model_loader: - kv 8: phi3.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: phi3.rope.dimension_count u32 = 96
llama_model_loader: - kv 11: phi3.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama_model_loader: - kv 14: tokenizer.ggml.pre str = default
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32064] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 32000
llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 32000
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 81 tensors
llama_model_loader: - type q5_K: 32 tensors
llama_model_loader: - type q6_K: 17 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi3
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32064
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 96
llm_load_print_meta: n_embd_head_k = 96
llm_load_print_meta: n_embd_head_v = 96
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 3072
llm_load_print_meta: n_embd_v_gqa = 3072
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.82 B
llm_load_print_meta: model size = 2.23 GiB (5.01 BPW)
llm_load_print_meta: general.name = Phi3
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 32000 '<|endoftext|>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOT token = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 1.3| 128| 1024| 32| 62583M| 1.3.29138|
| 1| [opencl:cpu:0]| Intel Core Ultra 7 155H| 3.0| 22| 8192| 64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
| 2| [opencl:acc:0]| Intel FPGA Emulation Device| 1.2| 22|67108864| 64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 buffer size = 2228.82 MiB
llm_load_tensors: CPU buffer size = 52.84 MiB
time=2024-06-26T11:13:49.597+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL0 KV buffer size = 768.00 MiB
llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.13 MiB
[1719371630] warming up the model with an empty run
llama_new_context_with_model: SYCL0 compute buffer size = 168.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 10.01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 2
oneapi::mkl::oneapi::mkl::blas::gemm: cannot allocate memory on host
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:15299, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"
[New LWP 53767]
[New LWP 53768]
[New LWP 53769]
[New LWP 53770]
[New LWP 53771]
[New LWP 53772]
[New LWP 53773]
[New LWP 53774]
[New LWP 53775]
[New LWP 53776]
[New LWP 53777]
[New LWP 53778]
[New LWP 53779]
[New LWP 53780]
[New LWP 53781]
[New LWP 53782]
[New LWP 53783]
[New LWP 53784]
[New LWP 53785]
[New LWP 53786]
[New LWP 53787]
[New LWP 53788]
[New LWP 53789]
[New LWP 53790]
[New LWP 53791]
[New LWP 53792]
[New LWP 53793]
[New LWP 53794]
[New LWP 53795]
[New LWP 53796]
[New LWP 53797]
[New LWP 53798]
[New LWP 53799]
[New LWP 53800]
[New LWP 53801]
[New LWP 53802]
[New LWP 53803]
[New LWP 53804]
[New LWP 53805]
[New LWP 53806]
[New LWP 53807]
[New LWP 53808]
[New LWP 53809]
[New LWP 53810]
[New LWP 53811]
[New LWP 53812]
[New LWP 53813]
[New LWP 53814]
[New LWP 53815]
[New LWP 53816]
[New LWP 53817]
[New LWP 53818]
[New LWP 53819]
[New LWP 53820]
time=2024-06-26T11:13:51.054+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding"
To enable execution of this file add
warning: File "/opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
add-auto-load-safe-path /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py
line to your configuration file "/root/.config/gdb/gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/root/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f968c2ea42f in __GI___wait4 (pid=53821, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f968c2ea42f in __GI___wait4 (pid=53821, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00000000006b9426 in ggml_sycl_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#2 0x00000000006b4e67 in ggml_sycl_compute_forward(ggml_compute_params*, ggml_tensor*) ()
#3 0x000000000077006f in ggml_backend_sycl_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4 0x0000000000673a48 in ggml_backend_sched_graph_compute_async ()
#5 0x00000000005820db in llama_decode ()
#6 0x00000000005066b7 in llama_init_from_gpt_params(gpt_params&) ()
#7 0x000000000043db28 in main ()
[Inferior 1 (process 53766) detached]
time=2024-06-26T11:13:52.214+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
time=2024-06-26T11:13:52.488+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
time=2024-06-26T11:13:52.739+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!\n in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299\nGGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !\"SYCL error\""
[GIN] 2024/06/26 - 11:13:52 | 500 | 3.522210885s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/06/26 - 11:15:27 | 200 | 31.206µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/06/26 - 11:15:27 | 200 | 417.158µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/06/26 - 11:15:27 | 200 | 375.924µs | 127.0.0.1 | POST "/api/show"
time=2024-06-26T11:15:27.897+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.9 GiB" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="768.0 MiB" memory.weights.total="2.2 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-06-26T11:15:27.897+08:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama2351670096/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 34589"
time=2024-06-26T11:15:27.898+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-26T11:15:27.898+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-26T11:15:27.898+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="adbd0dc" tid="140477976528896" timestamp=1719371727
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140477976528896" timestamp=1719371727 total_threads=22
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="21" port="34589" tid="140477976528896" timestamp=1719371727
llama_model_loader: loaded meta data with 26 key-value pairs and 195 tensors from /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi3
llama_model_loader: - kv 1: general.name str = Phi3
llama_model_loader: - kv 2: phi3.context_length u32 = 4096
llama_model_loader: - kv 3: phi3.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 4: phi3.embedding_length u32 = 3072
llama_model_loader: - kv 5: phi3.feed_forward_length u32 = 8192
llama_model_loader: - kv 6: phi3.block_count u32 = 32
llama_model_loader: - kv 7: phi3.attention.head_count u32 = 32
llama_model_loader: - kv 8: phi3.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: phi3.rope.dimension_count u32 = 96
llama_model_loader: - kv 11: phi3.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama_model_loader: - kv 14: tokenizer.ggml.pre str = default
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32064] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 32000
llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 32000
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 81 tensors
llama_model_loader: - type q5_K: 32 tensors
llama_model_loader: - type q6_K: 17 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi3
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32064
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 96
llm_load_print_meta: n_embd_head_k = 96
llm_load_print_meta: n_embd_head_v = 96
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 3072
llm_load_print_meta: n_embd_v_gqa = 3072
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.82 B
llm_load_print_meta: model size = 2.23 GiB (5.01 BPW)
llm_load_print_meta: general.name = Phi3
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 32000 '<|endoftext|>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOT token = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 1.3| 128| 1024| 32| 62583M| 1.3.29138|
| 1| [opencl:cpu:0]| Intel Core Ultra 7 155H| 3.0| 22| 8192| 64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
| 2| [opencl:acc:0]| Intel FPGA Emulation Device| 1.2| 22|67108864| 64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 buffer size = 2228.82 MiB
llm_load_tensors: CPU buffer size = 52.84 MiB
time=2024-06-26T11:15:28.150+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL0 KV buffer size = 768.00 MiB
llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.13 MiB
[1719371729] warming up the model with an empty run
llama_new_context_with_model: SYCL0 compute buffer size = 168.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 10.01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 2
oneapi::mkl::oneapi::mkl::blas::gemm: cannot allocate memory on host
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:15299, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"
Hi @kylinzhao90 , does your llm-cpp
has onednn
/ onemkl
related package?
If so, don't conda activate llm-cpp
, just in your base conda env (which don't have pip installed oneapi) . Try conda deactivate
and run ./ollama serve
.
llm-cpp has onednn / onemkl related package.
I have removed LD_LIBRARY_PATH config for envs
(base) root@O-E-M:~/ollama# conda env config vars list -n llm-cpp
(base) root@O-E-M:~/ollama#
(llm-cpp) root@O-E-M:~/ollama# pip list |grep onednn
onednn 2024.0.0
(llm-cpp) root@O-E-M:~/ollama# pip list |grep onemkl
onemkl-sycl-blas 2024.0.0
onemkl-sycl-datafitting 2024.0.0
onemkl-sycl-dft 2024.0.0
onemkl-sycl-lapack 2024.0.0
onemkl-sycl-rng 2024.0.0
onemkl-sycl-sparse 2024.0.0
onemkl-sycl-stats 2024.0.0
onemkl-sycl-vm 2024.0.0
(llm-cpp) root@O-E-M:~/ollama#
and the same issue
(base) root@O-E-M:~# source /opt/intel/oneapi/setvars.sh
:: initializing oneAPI environment ...
-bash: BASH_VERSION = 5.1.16(1)-release
args: Using "$@" for setvars.sh arguments:
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: oneAPI environment initialized ::
(base) root@O-E-M:~# cd ollama/
(base) root@O-E-M:~/ollama# ./ollama serve
2024/06/26 12:34:06 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-26T12:34:06.129+08:00 level=INFO source=images.go:729 msg="total blobs: 10"
time=2024-06-26T12:34:06.129+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env: export GIN_MODE=release
- using code: gin.SetMode(gin.ReleaseMode)
[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-06-26T12:34:06.130+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-06-26T12:34:06.130+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3082295135/runners
time=2024-06-26T12:34:06.191+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
[GIN] 2024/06/26 - 12:34:09 | 200 | 52.664µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/06/26 - 12:34:09 | 200 | 594.907µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/06/26 - 12:34:09 | 200 | 317.584µs | 127.0.0.1 | POST "/api/show"
time=2024-06-26T12:34:09.320+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.9 GiB" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="768.0 MiB" memory.weights.total="2.2 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2024-06-26T12:34:09.320+08:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama3082295135/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 43809"
time=2024-06-26T12:34:09.321+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-26T12:34:09.321+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-26T12:34:09.321+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="adbd0dc" tid="140337736312832" timestamp=1719376449
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140337736312832" timestamp=1719376449 total_threads=22
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="21" port="43809" tid="140337736312832" timestamp=1719376449
llama_model_loader: loaded meta data with 26 key-value pairs and 195 tensors from /root/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi3
llama_model_loader: - kv 1: general.name str = Phi3
llama_model_loader: - kv 2: phi3.context_length u32 = 4096
llama_model_loader: - kv 3: phi3.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 4: phi3.embedding_length u32 = 3072
llama_model_loader: - kv 5: phi3.feed_forward_length u32 = 8192
llama_model_loader: - kv 6: phi3.block_count u32 = 32
llama_model_loader: - kv 7: phi3.attention.head_count u32 = 32
llama_model_loader: - kv 8: phi3.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: phi3.rope.dimension_count u32 = 96
llama_model_loader: - kv 11: phi3.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama_model_loader: - kv 14: tokenizer.ggml.pre str = default
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32064] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 32000
llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 32000
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 81 tensors
llama_model_loader: - type q5_K: 32 tensors
llama_model_loader: - type q6_K: 17 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi3
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32064
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 96
llm_load_print_meta: n_embd_head_k = 96
llm_load_print_meta: n_embd_head_v = 96
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 3072
llm_load_print_meta: n_embd_v_gqa = 3072
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.82 B
llm_load_print_meta: model size = 2.23 GiB (5.01 BPW)
llm_load_print_meta: general.name = Phi3
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 32000 '<|endoftext|>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOT token = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 1.3| 128| 1024| 32| 62583M| 1.3.29138|
| 1| [opencl:cpu:0]| Intel Core Ultra 7 155H| 3.0| 22| 8192| 64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
| 2| [opencl:acc:0]| Intel FPGA Emulation Device| 1.2| 22|67108864| 64| 67080M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 buffer size = 2228.82 MiB
llm_load_tensors: CPU buffer size = 52.84 MiB
time=2024-06-26T12:34:09.573+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL0 KV buffer size = 768.00 MiB
llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.13 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 168.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 10.01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 2
[1719376450] warming up the model with an empty run
oneapi::mkl::oneapi::mkl::blas::gemm: cannot allocate memory on host
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:15299, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"
[New LWP 56060]
[New LWP 56061]
[New LWP 56062]
[New LWP 56063]
[New LWP 56064]
[New LWP 56065]
[New LWP 56066]
[New LWP 56067]
[New LWP 56068]
[New LWP 56069]
[New LWP 56070]
[New LWP 56071]
[New LWP 56072]
[New LWP 56073]
[New LWP 56074]
[New LWP 56075]
[New LWP 56076]
[New LWP 56077]
[New LWP 56078]
[New LWP 56079]
[New LWP 56080]
[New LWP 56081]
[New LWP 56082]
[New LWP 56083]
[New LWP 56084]
[New LWP 56085]
[New LWP 56086]
[New LWP 56087]
[New LWP 56088]
[New LWP 56089]
[New LWP 56090]
[New LWP 56091]
[New LWP 56092]
[New LWP 56093]
[New LWP 56094]
[New LWP 56095]
[New LWP 56096]
[New LWP 56097]
[New LWP 56098]
[New LWP 56099]
[New LWP 56100]
[New LWP 56101]
[New LWP 56102]
[New LWP 56103]
[New LWP 56104]
[New LWP 56105]
[New LWP 56106]
[New LWP 56107]
[New LWP 56108]
[New LWP 56109]
[New LWP 56110]
[New LWP 56111]
[New LWP 56112]
[New LWP 56113]
time=2024-06-26T12:34:11.028+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding"
To enable execution of this file add
add-auto-load-safe-path /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py
line to your configuration file "/root/.config/gdb/gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/root/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
warning: File "/opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fa2df6ea42f in __GI___wait4 (pid=56114, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007fa2df6ea42f in __GI___wait4 (pid=56114, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00000000006b9426 in ggml_sycl_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#2 0x00000000006b4e67 in ggml_sycl_compute_forward(ggml_compute_params*, ggml_tensor*) ()
#3 0x000000000077006f in ggml_backend_sycl_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4 0x0000000000673a48 in ggml_backend_sched_graph_compute_async ()
#5 0x00000000005820db in llama_decode ()
#6 0x00000000005066b7 in llama_init_from_gpt_params(gpt_params&) ()
#7 0x000000000043db28 in main ()
[Inferior 1 (process 56059) detached]
time=2024-06-26T12:34:12.182+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
time=2024-06-26T12:34:12.433+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!\n in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15299\nGGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !\"SYCL error\""
[GIN] 2024/06/26 - 12:34:12 | 500 | 3.236072737s | 127.0.0.1 | POST "/api/chat"
@rnwang04 any comments for this issue?
Sadly, we can't reproduce this issue on our Linux MTL (Intel(R) Core(TM) Ultra 5 125H) :
Regarding this error, the only related issue we have encountered is this: https://github.com/intel-analytics/ipex-llm/issues/10845, but is seems not work for you.
Here is our env info, just paste here for your reference:
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240626
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 18
On-line CPU(s) list: 0-17
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) Ultra 5 125H
CPU family: 6
Model: 170
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 1
Stepping: 4
CPU max MHz: 4500.0000
CPU min MHz: 400.0000
BogoMIPS: 5990.40
-----------------------------------------------------------------
Total CPU Memory: 30.9502 GB
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.3 LTS \n \l
-----------------------------------------------------------------
Linux xiaoxin04-ubuntu 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
Version: 1.2.22.20231126
Build ID: 00000000
Service:
Version: 1.2.22.20231126
Build ID: 00000000
Level Zero Version: 1.14.0
-----------------------------------------------------------------
Driver Version 2023.16.12.0.12_195853.xmain-hotfix
Driver Version 2023.16.12.0.12_195853.xmain-hotfix
Driver Version 2024.17.3.0.08_160000
Driver UUID 32342e30-392e-3238-3731-372e31320000
Driver Version 24.09.28717.12
Driver Version 2024.17.3.0.08_160000
-----------------------------------------------------------------
Driver related package version:
ii intel-fw-gpu 2023.39.2-255~22.04 all Firmware package for Intel integrated and discrete GPUs
ii intel-level-zero-gpu 1.3.28717.12 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii level-zero-dev 1.14.0-744~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
igpu detected
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [24.09.28717.12]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.28717]
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) Arc(TM) Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0200-0000-00087d558086 |
| | PCI BDF Address: 0000:00:02.0 |
| | DRM Device: /dev/dri/card0 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory ize=256M
-----------------------------------------------------------------
00:02.0 VGA compatible controller: Intel Corporation Device 7d55 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Lenovo Device 3cc9
Flags: bus master, fast devsel, latency 0, IRQ 184, IOMMU group 0
Memory at 408c000000 (64-bit, prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=256M]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: i915
Kernel modules: i915
-----------------------------------------------------------------
I meet this issue while using ollama on MTL iGPU my IPEX-LLM version as below iGPU info as below