ollama loading model gemma2 error：llama runner process has terminated: exit status 0xc0000409

JerryXu2023 commented 3 months ago

After running the ollama serve, there was an error when loading the gemma2 model. However, it's strange that there was no error in loading other models, such as loading qwen2 and llama3, which didn't have any issues. I have updated the ipex-llm[cpp] to version 2.1.0b20240805

Error message: GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/llama.cpp:10739: false time=2024-08-06T16:07:03.878+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding" time=2024-08-06T16:07:04.591+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" time=2024-08-06T16:07:04.851+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "

rnwang04 commented 3 months ago

Hi @JerryXu2023 , could you please run https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output? At the same time, could you please provide us with more detailed error log at server side and client side ?

JerryXu2023 commented 3 months ago

Hi @rnwang04 Thanks for you reply.

I ran the check program.

below is checking information: Python 3.11.9

transformers=4.43.1

torch=2.2.0+cpu

Name: ipex-llm Version: 2.1.0b20240805 Summary: Large Language Model Develop Toolkit Home-page: https://github.com/intel-analytics/ipex-llm Author: BigDL Authors Author-email: bigdl-user-group@googlegroups.com License: Apache License, Version 2.0 Location: d:\Anaconda3\envs\intel_gpu\Lib\site-packages Requires: Required-by:

IPEX is not installed properly.

Total Memory: 15.745 GB

Chip 0 Memory: 8 GB | Speed: 3200 MHz Chip 1 Memory: 8 GB | Speed: 3200 MHz

CPU Manufacturer: GenuineIntel CPU MaxClockSpeed: 2496 CPU Name: 11th Gen Intel(R) Core(TM) i5-11320H @ 3.20GHz CPU NumberOfCores: 4 CPU NumberOfLogicalProcessors: 8

GPU 0: Intel(R) Graphics Control Panel Driver Version: 32.0.101.5762 GPU 1: Intel(R) Iris(R) Xe Graphics Driver Version: 32.0.101.5762

System Information

主机名: XXX OS 名称: Microsoft Windows 11 专业版 OS 版本: 10.0.22631 暂缺 Build 22631 OS 制造商: Microsoft Corporation OS 配置: 成员工作站 OS 构建类型: Multiprocessor Free 注册的所有人:
注册的组织: 产品 ID: XXXXX 初始安装日期: 2023/2/8, 8:52:46 系统启动时间: 2024/8/5, 10:04:59 系统制造商: Dell Inc. 系统型号: Vostro 14 5410 系统类型: x64-based PC 处理器: 安装了 1 个处理器。 01: Intel64 Family 6 Model 140 Stepping 2 GenuineIntel ~2496 Mhz BIOS 版本: Dell Inc. 2.14.0, 2022/9/14 Windows 目录: C:\WINDOWS 系统目录: C:\WINDOWS\system32 启动设备: \Device\HarddiskVolume1 系统区域设置: zh-cn;中文(中国) 输入法区域设置: zh-cn;中文(中国) 时区: (UTC+08:00) 北京，重庆，香港特别行政区，乌鲁木齐物理内存总量: 16,123 MB 可用的物理内存: 5,963 MB 虚拟内存: 最大值: 40,699 MB 虚拟内存: 可用: 18,559 MB 虚拟内存: 使用中: 22,140 MB 页面文件位置: C:\pagefile.sys 域: XXXXXXX 登录服务器: \XXXXXXXXX 修补程序: 安装了 5 个修补程序。

              [02]: KB5012170
              [03]: KB5027397
              [04]: KB5040442
              [05]: KB5039338

网卡: 安装了 6 个 NIC。 01: Fortinet Virtual Ethernet Adapter (NDIS 6.30) 连接名: 以太网 2 状态: 媒体连接已中断 [02]: Fortinet SSL VPN Virtual Ethernet Adapter 连接名: 以太网 3 状态: 没有硬件 [03]: Realtek USB GbE Family Controller 连接名: 以太网 4 状态: 媒体连接已中断 [04]: Intel(R) Wi-Fi 6 AX201 160MHz 连接名: WLAN 状态: 媒体连接已中断 [05]: Bluetooth Device (Personal Area Network) 连接名: 蓝牙网络连接状态: 没有硬件 [06]: Realtek PCIe GbE Family Controller 连接名: 以太网启用 DHCP: 是 DHCP 服务器: XXXXXXX IP 地址

                    [02]: XXXXXXX

Hyper-V 要求: 已检测到虚拟机监控程序。将不显示 Hyper-V 所需的功能。

'xpu-smi' 不是内部或外部命令，也不是可运行的程序或批处理文件。 xpu-smi is not installed properly.

rnwang04 commented 3 months ago

Hi @JerryXu2023 , got it, thanks for quick reply. Could you please also provide us your details cmd / ollama server log / ollama client log ? Then we will try to see whether we can reproduce this issue on our Iris iGPU : )

JerryXu2023 commented 3 months ago

Hi @rnwang04 Below is run olllama serve log: 2024/08/07 13:55:22 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0:] OLLAMA_RUNNERS_DIR:D:\python\ai\llama-cpp\dist\windows-amd64\ollama_runners OLLAMA_TMPDIR:]" time=2024-08-07T13:55:22.755+08:00 level=INFO source=images.go:729 msg="total blobs: 25" time=2024-08-07T13:55:22.766+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

using env: export GIN_MODE=release
using code: gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST /api/pull [GIN-debug] POST /api/generate [GIN-debug] POST /api/chat [GIN-debug] POST /api/embeddings [GIN-debug] POST /api/create [GIN-debug] POST /api/push [GIN-debug] POST /api/copy [GIN-debug] DELETE /api/delete [GIN-debug] POST /api/show [GIN-debug] POST /api/blobs/:digest [GIN-debug] HEAD /api/blobs/:digest [GIN-debug] GET /api/ps [GIN-debug] POST /v1/chat/completions [GIN-debug] GET / [GIN-debug] GET /api/tags [GIN-debug] GET /api/version [GIN-debug] HEAD / [GIN-debug] HEAD /api/tags [GIN-debug] HEAD /api/version time=2024-08-07T13:55:22.777+08:00 time=2024-08-07T13:55:22.777+08:00 --> github.com/ollama/ollama/server.(Server).PullModelHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).GenerateHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).EmbeddingsHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).CreateModelHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).PushModelHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).CopyModelHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).DeleteModelHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).ShowModelHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).CreateBlobHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).HeadBlobHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).ProcessHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (6 handlers) --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func2 (5 handlers) --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"

and below is ollama run gemma log: 2024/08/07 13:55:22 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0:] OLLAMA_RUNNERS_DIR:D:\python\ai\llama-cpp\dist\windows-amd64\ollama_runners OLLAMA_TMPDIR:]" time=2024-08-07T13:55:22.755+08:00 level=INFO source=images.go:729 msg="total blobs: 25" time=2024-08-07T13:55:22.766+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

using env: export GIN_MODE=release
using code: gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-08-07T13:55:22.777+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-08-07T13:55:22.777+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]" [GIN] 2024/08/07 - 14:01:17	200	0s	127.0.0.1	HEAD "/" time=2024-08-07T14:01:17.457+08:00 level=WARN source=routes.go:757 msg="bad manifest config filepath" name=registry.ollama.ai/library/Unichat-llama3-Chinese-8B:latest error="open D:\software\ollama_models\blobs\sha256-99d9b27ff44d023077be1be3728f1eb8b668bc5a9eef324346428e8e5f0150a5: The system cannot find the file specified." [GIN] 2024/08/07 - 14:01:17	200	85.7745ms	127.0.0.1	GET "/api/tags" [GIN] 2024/08/07 - 14:01:27
0	[level_zero:gpu:0]	Intel Iris Xe Graphics	1.3	96	512	32	7473M	1.3.29803
1	[opencl:gpu:0]	Intel Iris Xe Graphics	3.0	96	512	32	7473M	32.0.101.5762
2	[opencl:cpu:0]	11th Gen Intel Core i5-11320H @ 3.20GHz	3.0	8	8192	64	16905M	2024.18.6.0.02_160000

ggml_backend_sycl_set_mul_device_mode: true detect 1 SYCL GPUs: [0] with top Max compute units:96 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llm_load_tensors: ggml ctx size = 0.28 MiB llm_load_tensors: offloading 26 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 27/27 layers to GPU llm_load_tensors: SYCL0 buffer size = 1548.29 MiB llm_load_tensors: CPU buffer size = 461.43 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: SYCL0 KV buffer size = 208.00 MiB llama_new_context_with_model: KV self size = 208.00 MiB, K (f16): 104.00 MiB, V (f16): 104.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.99 MiB GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/llama.cpp:10739: false time=2024-08-07T14:01:46.178+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding" time=2024-08-07T14:01:46.571+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" time=2024-08-07T14:01:47.094+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 " [GIN] 2024/08/07 - 14:01:47 | 500 | 19.366325s | 127.0.0.1 | POST "/api/chat"

rnwang04 commented 3 months ago

Hi @JerryXu2023 , I have reproduced your error. Actually we only added support for gemma2-9b before, and gemma2-2b is not supported now. I will try to add support for it, and once it's done, will update here to let you know.

JerryXu2023 commented 3 months ago

Hi @rnwang04 I noted. Thanks for your support!

rnwang04 commented 3 months ago

Support for gemma2-2b is added. You can try it again with ipex-llm[cpp]>=2.1.0b20240807 tomorrow😊

JerryXu2023 commented 3 months ago

I will try it and report result tomorrow. Thanks again

JerryXu2023 commented 3 months ago

Hi @rnwang04 There is on issue for ollama run Gemma2:2b on version 2.1.0b20240807 However, I found that after entering the model, when I ask questions, the model does not respond. I'm not sure if it's an issue with my personal computer. Could you try to reproduce the issue? Thanks

rnwang04 commented 3 months ago

Hi @JerryXu2023 ， I have reproduced this issue on my side, a little strange, but I feel it's the model issue itself. I have tried this GGUF model (https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/blob/main/gemma-2-2b-it-Q4_K_S.gguf) with ipex-llm's llama.cpp, it work fine. At the same time, I try to use this gguf in ollama, it also can have output, although quality it not too good, maybe need some prompt.

rnwang04 commented 2 months ago

Hi @JerryXu2023 , here is a new workaround for gemma2:2b: https://github.com/intel-analytics/ipex-llm/issues/11771#issuecomment-2285483849 Hope it may help 😊

JerryXu2023 commented 2 months ago

yeah~~ It works fine！Thanks so much!

intel-analytics / ipex-llm