Closed JerryXu2023 closed 2 months ago
Hi @JerryXu2023 , could you please run https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output? At the same time, could you please provide us with more detailed error log at server side and client side ?
Hi @rnwang04 Thanks for you reply.
I ran the check program.
Total Memory: 15.745 GB
System Information
主机名: XXX
OS 名称: Microsoft Windows 11 专业版
OS 版本: 10.0.22631 暂缺 Build 22631
OS 制造商: Microsoft Corporation
OS 配置: 成员工作站
OS 构建类型: Multiprocessor Free
注册的所有人:
注册的组织:
产品 ID: XXXXX
初始安装日期: 2023/2/8, 8:52:46
系统启动时间: 2024/8/5, 10:04:59
系统制造商: Dell Inc.
系统型号: Vostro 14 5410
系统类型: x64-based PC
处理器: 安装了 1 个处理器。
01: Intel64 Family 6 Model 140 Stepping 2 GenuineIntel ~2496 Mhz
BIOS 版本: Dell Inc. 2.14.0, 2022/9/14
Windows 目录: C:\WINDOWS
系统目录: C:\WINDOWS\system32
启动设备: \Device\HarddiskVolume1
系统区域设置: zh-cn;中文(中国)
输入法区域设置: zh-cn;中文(中国)
时区: (UTC+08:00) 北京,重庆,香港特别行政区,乌鲁木齐
物理内存总量: 16,123 MB
可用的物理内存: 5,963 MB
虚拟内存: 最大值: 40,699 MB
虚拟内存: 可用: 18,559 MB
虚拟内存: 使用中: 22,140 MB
页面文件位置: C:\pagefile.sys
域: XXXXXXX
登录服务器: \XXXXXXXXX
修补程序: 安装了 5 个修补程序。
[02]: KB5012170
[03]: KB5027397
[04]: KB5040442
[05]: KB5039338
网卡: 安装了 6 个 NIC。 01: Fortinet Virtual Ethernet Adapter (NDIS 6.30) 连接名: 以太网 2 状态: 媒体连接已中断 [02]: Fortinet SSL VPN Virtual Ethernet Adapter 连接名: 以太网 3 状态: 没有硬件 [03]: Realtek USB GbE Family Controller 连接名: 以太网 4 状态: 媒体连接已中断 [04]: Intel(R) Wi-Fi 6 AX201 160MHz 连接名: WLAN 状态: 媒体连接已中断 [05]: Bluetooth Device (Personal Area Network) 连接名: 蓝牙网络连接 状态: 没有硬件 [06]: Realtek PCIe GbE Family Controller 连接名: 以太网 启用 DHCP: 是 DHCP 服务器: XXXXXXX IP 地址
[02]: XXXXXXX
'xpu-smi' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 xpu-smi is not installed properly.
Hi @JerryXu2023 , got it, thanks for quick reply. Could you please also provide us your details cmd / ollama server log / ollama client log ? Then we will try to see whether we can reproduce this issue on our Iris iGPU : )
Hi @rnwang04 Below is run olllama serve log: 2024/08/07 13:55:22 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0:] OLLAMA_RUNNERS_DIR:D:\python\ai\llama-cpp\dist\windows-amd64\ollama_runners OLLAMA_TMPDIR:]" time=2024-08-07T13:55:22.755+08:00 level=INFO source=images.go:729 msg="total blobs: 25" time=2024-08-07T13:55:22.766+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-08-07T13:55:22.777+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-08-07T13:55:22.777+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
and below is ollama run gemma log: 2024/08/07 13:55:22 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0:] OLLAMA_RUNNERS_DIR:D:\python\ai\llama-cpp\dist\windows-amd64\ollama_runners OLLAMA_TMPDIR:]" time=2024-08-07T13:55:22.755+08:00 level=INFO source=images.go:729 msg="total blobs: 25" time=2024-08-07T13:55:22.766+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-08-07T13:55:22.777+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-08-07T13:55:22.777+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]" [GIN] 2024/08/07 - 14:01:17 | 200 | 0s | 127.0.0.1 | HEAD "/" time=2024-08-07T14:01:17.457+08:00 level=WARN source=routes.go:757 msg="bad manifest config filepath" name=registry.ollama.ai/library/Unichat-llama3-Chinese-8B:latest error="open D:\software\ollama_models\blobs\sha256-99d9b27ff44d023077be1be3728f1eb8b668bc5a9eef324346428e8e5f0150a5: The system cannot find the file specified." [GIN] 2024/08/07 - 14:01:17 | 200 | 85.7745ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/08/07 - 14:01:27 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/08/07 - 14:01:27 | 200 | 11.1779ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/08/07 - 14:01:27 | 200 | 3.1559ms | 127.0.0.1 | POST "/api/show" time=2024-08-07T14:01:34.411+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=27 memory.available="5.0 GiB" memory.required.full="1.9 GiB" memory.required.partial="1.9 GiB" memory.required.kv="234.0 MiB" memory.weights.total="1.5 GiB" memory.weights.repeating="1.1 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="78.0 MiB" memory.graph.partial="78.0 MiB" time=2024-08-07T14:01:34.443+08:00 level=INFO source=server.go:342 msg="starting llama server" cmd="D:\python\ai\llama-cpp\dist\windows-amd64\ollama_runners\cpu_avx2\ollama_llama_server.exe --model D:\software\ollama_models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 63315" time=2024-08-07T14:01:34.647+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-08-07T14:01:34.658+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-08-07T14:01:34.666+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=1 commit="b791c1a" tid="18300" timestamp=1723010495 INFO [wmain] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="18300" timestamp=1723010495 total_threads=8 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="63315" tid="18300" timestamp=1723010495
llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from D:\software\ollama_models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma 2.0 2b It Transformers
llama_model_loader: - kv 3: general.finetune str = it-transformers
llama_model_loader: - kv 4: general.basename str = gemma-2.0
llama_model_loader: - kv 5: general.size_label str = 2B
llama_model_loader: - kv 6: general.license str = gemma
llama_model_loader: - kv 7: gemma2.context_length u32 = 8192
llama_model_loader: - kv 8: gemma2.embedding_length u32 = 2304
llama_model_loader: - kv 9: gemma2.block_count u32 = 26
llama_model_loader: - kv 10: gemma2.feed_forward_length u32 = 9216
llama_model_loader: - kv 11: gemma2.attention.head_count u32 = 8
llama_model_loader: - kv 12: gemma2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 13: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma2.attention.key_length u32 = 256
llama_model_loader: - kv 15: gemma2.attention.value_length u32 = 256
llama_model_loader: - kv 16: general.file_type u32 = 2
llama_model_loader: - kv 17: gemma2.attn_logit_softcapping f32 = 50.000000
llama_model_loader: - kv 18: gemma2.final_logit_softcapping f32 = 30.000000
llama_model_loader: - kv 19: gemma2.attention.sliding_window u32 = 4096
llama_model_loader: - kv 20: tokenizer.ggml.model str = llama
llama_model_loader: - kv 21: tokenizer.ggml.pre str = default
time=2024-08-07T14:01:35.704+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,256000] = [" |
Max | Max | Global | compute | Max work | sub | mem | ID | Device Type | Name | Version | units | group | group | size | Driver version | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [level_zero:gpu:0] | Intel Iris Xe Graphics | 1.3 | 96 | 512 | 32 | 7473M | 1.3.29803 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | [opencl:gpu:0] | Intel Iris Xe Graphics | 3.0 | 96 | 512 | 32 | 7473M | 32.0.101.5762 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | [opencl:cpu:0] | 11th Gen Intel Core i5-11320H @ 3.20GHz | 3.0 | 8 | 8192 | 64 | 16905M | 2024.18.6.0.02_160000 |
ggml_backend_sycl_set_mul_device_mode: true detect 1 SYCL GPUs: [0] with top Max compute units:96 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llm_load_tensors: ggml ctx size = 0.28 MiB llm_load_tensors: offloading 26 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 27/27 layers to GPU llm_load_tensors: SYCL0 buffer size = 1548.29 MiB llm_load_tensors: CPU buffer size = 461.43 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: SYCL0 KV buffer size = 208.00 MiB llama_new_context_with_model: KV self size = 208.00 MiB, K (f16): 104.00 MiB, V (f16): 104.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.99 MiB GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/llama.cpp:10739: false time=2024-08-07T14:01:46.178+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding" time=2024-08-07T14:01:46.571+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" time=2024-08-07T14:01:47.094+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 " [GIN] 2024/08/07 - 14:01:47 | 500 | 19.366325s | 127.0.0.1 | POST "/api/chat"
Hi @JerryXu2023 , I have reproduced your error. Actually we only added support for gemma2-9b before, and gemma2-2b is not supported now. I will try to add support for it, and once it's done, will update here to let you know.
Hi @rnwang04 I noted. Thanks for your support!
Support for gemma2-2b is added. You can try it again with ipex-llm[cpp]>=2.1.0b20240807
tomorrow😊
I will try it and report result tomorrow. Thanks again
Hi @rnwang04 There is on issue for ollama run Gemma2:2b on version 2.1.0b20240807 However, I found that after entering the model, when I ask questions, the model does not respond. I'm not sure if it's an issue with my personal computer. Could you try to reproduce the issue? Thanks
Hi @JerryXu2023 , I have reproduced this issue on my side, a little strange, but I feel it's the model issue itself. I have tried this GGUF model (https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/blob/main/gemma-2-2b-it-Q4_K_S.gguf) with ipex-llm's llama.cpp, it work fine. At the same time, I try to use this gguf in ollama, it also can have output, although quality it not too good, maybe need some prompt.
Hi @JerryXu2023 , here is a new workaround for gemma2:2b: https://github.com/intel-analytics/ipex-llm/issues/11771#issuecomment-2285483849 Hope it may help 😊
yeah~~ It works fine!Thanks so much!
After running the ollama serve, there was an error when loading the gemma2 model. However, it's strange that there was no error in loading other models, such as loading qwen2 and llama3, which didn't have any issues. I have updated the ipex-llm[cpp] to version 2.1.0b20240805
Error message: GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/llama.cpp:10739: false time=2024-08-06T16:07:03.878+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding" time=2024-08-06T16:07:04.591+08:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" time=2024-08-06T16:07:04.851+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "