intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.68k stars 1.26k forks source link

ollama loading model gemma2 error:llama runner process has terminated: exit status 0xc0000409 #11723

Closed JerryXu2023 closed 2 months ago

JerryXu2023 commented 3 months ago

After running the ollama serve, there was an error when loading the gemma2 model. However, it's strange that there was no error in loading other models, such as loading qwen2 and llama3, which didn't have any issues. I have updated the ipex-llm[cpp] to version 2.1.0b20240805

Error message: GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/llama.cpp:10739: false time=2024-08-06T16:07:03.878+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding" time=2024-08-06T16:07:04.591+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" time=2024-08-06T16:07:04.851+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "

rnwang04 commented 3 months ago

Hi @JerryXu2023 , could you please run https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output? At the same time, could you please provide us with more detailed error log at server side and client side ?

JerryXu2023 commented 3 months ago

Hi @rnwang04 Thanks for you reply.

I ran the check program.

below is checking information: Python 3.11.9

transformers=4.43.1

torch=2.2.0+cpu

Name: ipex-llm Version: 2.1.0b20240805 Summary: Large Language Model Develop Toolkit Home-page: https://github.com/intel-analytics/ipex-llm Author: BigDL Authors Author-email: bigdl-user-group@googlegroups.com License: Apache License, Version 2.0 Location: d:\Anaconda3\envs\intel_gpu\Lib\site-packages Requires: Required-by:

IPEX is not installed properly.

Total Memory: 15.745 GB

Chip 0 Memory: 8 GB | Speed: 3200 MHz Chip 1 Memory: 8 GB | Speed: 3200 MHz

CPU Manufacturer: GenuineIntel CPU MaxClockSpeed: 2496 CPU Name: 11th Gen Intel(R) Core(TM) i5-11320H @ 3.20GHz CPU NumberOfCores: 4 CPU NumberOfLogicalProcessors: 8

GPU 0: Intel(R) Graphics Control Panel Driver Version: 32.0.101.5762 GPU 1: Intel(R) Iris(R) Xe Graphics Driver Version: 32.0.101.5762


System Information

主机名: XXX OS 名称: Microsoft Windows 11 专业版 OS 版本: 10.0.22631 暂缺 Build 22631 OS 制造商: Microsoft Corporation OS 配置: 成员工作站 OS 构建类型: Multiprocessor Free 注册的所有人:
注册的组织: 产品 ID: XXXXX 初始安装日期: 2023/2/8, 8:52:46 系统启动时间: 2024/8/5, 10:04:59 系统制造商: Dell Inc. 系统型号: Vostro 14 5410 系统类型: x64-based PC 处理器: 安装了 1 个处理器。 01: Intel64 Family 6 Model 140 Stepping 2 GenuineIntel ~2496 Mhz BIOS 版本: Dell Inc. 2.14.0, 2022/9/14 Windows 目录: C:\WINDOWS 系统目录: C:\WINDOWS\system32 启动设备: \Device\HarddiskVolume1 系统区域设置: zh-cn;中文(中国) 输入法区域设置: zh-cn;中文(中国) 时区: (UTC+08:00) 北京,重庆,香港特别行政区,乌鲁木齐 物理内存总量: 16,123 MB 可用的物理内存: 5,963 MB 虚拟内存: 最大值: 40,699 MB 虚拟内存: 可用: 18,559 MB 虚拟内存: 使用中: 22,140 MB 页面文件位置: C:\pagefile.sys 域: XXXXXXX 登录服务器: \XXXXXXXXX 修补程序: 安装了 5 个修补程序。

              [02]: KB5012170
              [03]: KB5027397
              [04]: KB5040442
              [05]: KB5039338

网卡: 安装了 6 个 NIC。 01: Fortinet Virtual Ethernet Adapter (NDIS 6.30) 连接名: 以太网 2 状态: 媒体连接已中断 [02]: Fortinet SSL VPN Virtual Ethernet Adapter 连接名: 以太网 3 状态: 没有硬件 [03]: Realtek USB GbE Family Controller 连接名: 以太网 4 状态: 媒体连接已中断 [04]: Intel(R) Wi-Fi 6 AX201 160MHz 连接名: WLAN 状态: 媒体连接已中断 [05]: Bluetooth Device (Personal Area Network) 连接名: 蓝牙网络连接 状态: 没有硬件 [06]: Realtek PCIe GbE Family Controller 连接名: 以太网 启用 DHCP: 是 DHCP 服务器: XXXXXXX IP 地址

                    [02]: XXXXXXX

Hyper-V 要求: 已检测到虚拟机监控程序。将不显示 Hyper-V 所需的功能。

'xpu-smi' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 xpu-smi is not installed properly.

rnwang04 commented 3 months ago

Hi @JerryXu2023 , got it, thanks for quick reply. Could you please also provide us your details cmd / ollama server log / ollama client log ? Then we will try to see whether we can reproduce this issue on our Iris iGPU : )

JerryXu2023 commented 3 months ago

Hi @rnwang04 Below is run olllama serve log: 2024/08/07 13:55:22 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0:] OLLAMA_RUNNERS_DIR:D:\python\ai\llama-cpp\dist\windows-amd64\ollama_runners OLLAMA_TMPDIR:]" time=2024-08-07T13:55:22.755+08:00 level=INFO source=images.go:729 msg="total blobs: 25" time=2024-08-07T13:55:22.766+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-08-07T13:55:22.777+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-08-07T13:55:22.777+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"

and below is ollama run gemma log: 2024/08/07 13:55:22 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0:] OLLAMA_RUNNERS_DIR:D:\python\ai\llama-cpp\dist\windows-amd64\ollama_runners OLLAMA_TMPDIR:]" time=2024-08-07T13:55:22.755+08:00 level=INFO source=images.go:729 msg="total blobs: 25" time=2024-08-07T13:55:22.766+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-08-07T13:55:22.777+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-08-07T13:55:22.777+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]" [GIN] 2024/08/07 - 14:01:17 200 0s 127.0.0.1 HEAD "/" time=2024-08-07T14:01:17.457+08:00 level=WARN source=routes.go:757 msg="bad manifest config filepath" name=registry.ollama.ai/library/Unichat-llama3-Chinese-8B:latest error="open D:\software\ollama_models\blobs\sha256-99d9b27ff44d023077be1be3728f1eb8b668bc5a9eef324346428e8e5f0150a5: The system cannot find the file specified." [GIN] 2024/08/07 - 14:01:17 200 85.7745ms 127.0.0.1 GET "/api/tags" [GIN] 2024/08/07 - 14:01:27 200 0s 127.0.0.1 HEAD "/" [GIN] 2024/08/07 - 14:01:27 200 11.1779ms 127.0.0.1 POST "/api/show" [GIN] 2024/08/07 - 14:01:27 200 3.1559ms 127.0.0.1 POST "/api/show" time=2024-08-07T14:01:34.411+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=27 memory.available="5.0 GiB" memory.required.full="1.9 GiB" memory.required.partial="1.9 GiB" memory.required.kv="234.0 MiB" memory.weights.total="1.5 GiB" memory.weights.repeating="1.1 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="78.0 MiB" memory.graph.partial="78.0 MiB" time=2024-08-07T14:01:34.443+08:00 level=INFO source=server.go:342 msg="starting llama server" cmd="D:\python\ai\llama-cpp\dist\windows-amd64\ollama_runners\cpu_avx2\ollama_llama_server.exe --model D:\software\ollama_models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 63315" time=2024-08-07T14:01:34.647+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-08-07T14:01:34.658+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-08-07T14:01:34.666+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info build=1 commit="b791c1a" tid="18300" timestamp=1723010495 INFO [wmain] system info n_threads=4 n_threads_batch=-1 system_info="AVX = 1 AVX_VNNI = 0 AVX2 = 1 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 AVX512_BF16 = 0 FMA = 1 NEON = 0 SVE = 0 ARM_FMA = 0 F16C = 1 FP16_VA = 0 WASM_SIMD = 0 BLAS = 1 SSE3 = 1 SSSE3 = 1 VSX = 0 MATMUL_INT8 = 0 LLAMAFILE = 1 " tid="18300" timestamp=1723010495 total_threads=8 INFO [wmain] HTTP server listening hostname="127.0.0.1" n_threads_http="7" port="63315" tid="18300" timestamp=1723010495 llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from D:\software\ollama_models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma 2.0 2b It Transformers llama_model_loader: - kv 3: general.finetune str = it-transformers llama_model_loader: - kv 4: general.basename str = gemma-2.0 llama_model_loader: - kv 5: general.size_label str = 2B llama_model_loader: - kv 6: general.license str = gemma llama_model_loader: - kv 7: gemma2.context_length u32 = 8192 llama_model_loader: - kv 8: gemma2.embedding_length u32 = 2304 llama_model_loader: - kv 9: gemma2.block_count u32 = 26 llama_model_loader: - kv 10: gemma2.feed_forward_length u32 = 9216 llama_model_loader: - kv 11: gemma2.attention.head_count u32 = 8 llama_model_loader: - kv 12: gemma2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 13: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 15: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 16: general.file_type u32 = 2 llama_model_loader: - kv 17: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 18: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 19: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 20: tokenizer.ggml.model str = llama llama_model_loader: - kv 21: tokenizer.ggml.pre str = default time=2024-08-07T14:01:35.704+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ... llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 31: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 32: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 105 tensors llama_model_loader: - type q4_0: 182 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 418/256000 vs 505/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2304 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 26 llm_load_print_meta: n_rot = 288 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 9216 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.61 B llm_load_print_meta: model size = 1.51 GiB (4.97 BPW) llm_load_print_meta: general.name = Gemma 2.0 2b It Transformers llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 1 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '' [SYCL] call ggml_init_sycl ggml_init_sycl: GGML_SYCL_DEBUG: 0 ggml_init_sycl: GGML_SYCL_F16: no found 3 SYCL devices: Max Max Global compute Max work sub mem ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Iris Xe Graphics 1.3 96 512 32 7473M 1.3.29803
1 [opencl:gpu:0] Intel Iris Xe Graphics 3.0 96 512 32 7473M 32.0.101.5762
2 [opencl:cpu:0] 11th Gen Intel Core i5-11320H @ 3.20GHz 3.0 8 8192 64 16905M 2024.18.6.0.02_160000

ggml_backend_sycl_set_mul_device_mode: true detect 1 SYCL GPUs: [0] with top Max compute units:96 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llm_load_tensors: ggml ctx size = 0.28 MiB llm_load_tensors: offloading 26 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 27/27 layers to GPU llm_load_tensors: SYCL0 buffer size = 1548.29 MiB llm_load_tensors: CPU buffer size = 461.43 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: SYCL0 KV buffer size = 208.00 MiB llama_new_context_with_model: KV self size = 208.00 MiB, K (f16): 104.00 MiB, V (f16): 104.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.99 MiB GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/llama.cpp:10739: false time=2024-08-07T14:01:46.178+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding" time=2024-08-07T14:01:46.571+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" time=2024-08-07T14:01:47.094+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 " [GIN] 2024/08/07 - 14:01:47 | 500 | 19.366325s | 127.0.0.1 | POST "/api/chat"

rnwang04 commented 3 months ago

Hi @JerryXu2023 , I have reproduced your error. Actually we only added support for gemma2-9b before, and gemma2-2b is not supported now. I will try to add support for it, and once it's done, will update here to let you know.

JerryXu2023 commented 3 months ago

Hi @rnwang04 I noted. Thanks for your support!

rnwang04 commented 3 months ago

Support for gemma2-2b is added. You can try it again with ipex-llm[cpp]>=2.1.0b20240807 tomorrow😊

JerryXu2023 commented 3 months ago

I will try it and report result tomorrow. Thanks again

JerryXu2023 commented 3 months ago

Hi @rnwang04 There is on issue for ollama run Gemma2:2b on version 2.1.0b20240807 However, I found that after entering the model, when I ask questions, the model does not respond. I'm not sure if it's an issue with my personal computer. Could you try to reproduce the issue? Thanks

rnwang04 commented 3 months ago

Hi @JerryXu2023 , I have reproduced this issue on my side, a little strange, but I feel it's the model issue itself. I have tried this GGUF model (https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/blob/main/gemma-2-2b-it-Q4_K_S.gguf) with ipex-llm's llama.cpp, it work fine. At the same time, I try to use this gguf in ollama, it also can have output, although quality it not too good, maybe need some prompt.

rnwang04 commented 2 months ago

Hi @JerryXu2023 , here is a new workaround for gemma2:2b: https://github.com/intel-analytics/ipex-llm/issues/11771#issuecomment-2285483849 Hope it may help 😊

JerryXu2023 commented 2 months ago

yeah~~ It works fine!Thanks so much!