intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.48k stars 1.24k forks source link

Error: llama runner process has terminated: exit status 0xc0000409 #11708

Open EMPERORAYUSH opened 1 month ago

EMPERORAYUSH commented 1 month ago

I have setup ipex-llm by following install ipex-llm for llamacpp until step 2 as my main goal was to run ollama on my integrated intel corporation UHD graphics, and 3rd step was example.

Then, from initialise ollama quickstart I initialised ollama and set the enviorment variables:

set OLLAMA_NUM_GPU=999
set no_proxy=localhost,127.0.0.1
set ZES_ENABLE_SYSMAN=1
set SYCL_CACHE_PERSISTENT=1
set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

and then served ollama: ollama serve

Now, after serving ollama, i saw this output in terminal:

2024/08/02 19:07:04 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR:C:\\Users\\ayush\\llama-cpp\\dist\\windows-amd64\\ollama_runners OLLAMA_TMPDIR:]"
time=2024-08-02T19:07:04.511+05:30 level=INFO source=images.go:729 msg="total blobs: 10"
time=2024-08-02T19:07:04.512+05:30 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-08-02T19:07:04.522+05:30 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-08-02T19:07:04.524+05:30 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cpu]"

Now, when i try to run any model, for example tinyllama, i see this output in miniforge prompt (where ollama is running):

[GIN] 2024/08/02 - 19:07:07 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/02 - 19:07:07 | 200 |      2.0067ms |       127.0.0.1 | POST     "/api/show"
time=2024-08-02T19:07:07.969+05:30 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=23 memory.available="3.0 GiB" memory.required.full="787.0 MiB" memory.required.partial="787.0 MiB" memory.required.kv="44.0 MiB" memory.weights.total="571.4 MiB" memory.weights.repeating="520.1 MiB" memory.weights.nonrepeating="51.3 MiB" memory.graph.full="148.0 MiB" memory.graph.partial="144.3 MiB"
time=2024-08-02T19:07:07.970+05:30 level=INFO source=server.go:342 msg="starting llama server" cmd="C:\\Users\\ayush\\llama-cpp\\dist\\windows-amd64\\ollama_runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\ayush\\.ollama\\models\\blobs\\sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 51653"
time=2024-08-02T19:07:07.975+05:30 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-08-02T19:07:07.980+05:30 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-08-02T19:07:07.980+05:30 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=1 commit="b791c1a" tid="3128" timestamp=1722605828
INFO [wmain] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="3128" timestamp=1722605828 total_threads=4
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="3" port="51653" tid="3128" timestamp=1722605828
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from C:\Users\ayush\.ollama\models\blobs\sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = TinyLlama
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,61249]   = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type q4_0:  155 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 1.10 B
llm_load_print_meta: model size       = 606.53 MiB (4.63 BPW)
llm_load_print_meta: general.name     = TinyLlama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
time=2024-08-02T19:07:08.239+05:30 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0|     [opencl:gpu:0]|                     Intel UHD Graphics|    3.0|     32|     256|   32|  3344M|       27.20.100.8935|
ggml_backend_sycl_set_mul_device_mode: true
llama_model_load: error loading model: DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
llama_load_model_from_file: exception loading model
time=2024-08-02T19:07:08.600+05:30 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
time=2024-08-02T19:07:08.852+05:30 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "
[GIN] 2024/08/02 - 19:07:08 | 500 |    1.5379956s |       127.0.0.1 | POST     "/api/chat"

Above is the continuation of the previous output

And in the terminal where i tried to run the model, by running: ollama run tinyllama

I see this: Error: llama runner process has terminated: exit status 0xc0000409

Please help me fix this issue!

sgwhat commented 1 month ago

hi @EMPERORAYUSH, could you please run https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output?

EMPERORAYUSH commented 1 month ago
(llm-cpp) C:\AYUSH PANDEY\xpu-smi-1.2.38-20240718.060120.0db09695_win>.\env-check
Python 3.11.9
-----------------------------------------------------------------
transformers=4.43.3
-----------------------------------------------------------------
torch=2.2.0+cpu
-----------------------------------------------------------------
Name: ipex-llm
Version: 2.1.0b20240802
Summary: Large Language Model Develop Toolkit
Home-page: https://github.com/intel-analytics/ipex-llm
Author: BigDL Authors
Author-email: bigdl-user-group@googlegroups.com
License: Apache License, Version 2.0
Location: C:\Users\ayush\miniforge-pypy3\envs\llm-cpp\Lib\site-packages
Requires:
Required-by:
-----------------------------------------------------------------
IPEX is not installed properly.
-----------------------------------------------------------------

-----------------------------------------------------------------
Traceback (most recent call last):
  File "C:\AYUSH PANDEY\xpu-smi-1.2.38-20240718.060120.0db09695_win\check.py", line 179, in <module>
    main()
  File "C:\AYUSH PANDEY\xpu-smi-1.2.38-20240718.060120.0db09695_win\check.py", line 173, in main
    check_cpu()
  File "C:\AYUSH PANDEY\xpu-smi-1.2.38-20240718.060120.0db09695_win\check.py", line 111, in check_cpu
    values = cpu_info[1]
             ~~~~~~~~^^^
IndexError: list index out of range
-----------------------------------------------------------------
System Information

Host Name:                 NZXT-CUSTOM
OS Name:                   Microsoft Windows 11 Home Single Language
OS Version:                10.0.22631 N/A Build 22631
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Standalone Workstation
OS Build Type:             Multiprocessor Free
Registered Owner:          HP
Registered Organization:   HP
Product ID:                00327-35901-64212-AAOEM
Original Install Date:     14-03-2024, 12:13:12
System Boot Time:          04-08-2024, 17:02:13
System Manufacturer:       HP
System Model:              HP Laptop 15q-ds3xxx
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 126 Stepping 5 GenuineIntel ~1190 Mhz
BIOS Version:              Insyde F.36, 03-02-2021
Windows Directory:         C:\WINDOWS
System Directory:          C:\WINDOWS\system32
Boot Device:               \Device\HarddiskVolume1
System Locale:             hi;Hindi
Input Locale:              00004009
Time Zone:                 (UTC+05:30) Chennai, Kolkata, Mumbai, New Delhi
Total Physical Memory:     7,974 MB
Available Physical Memory: 1,792 MB
Virtual Memory: Max Size:  20,262 MB
Virtual Memory: Available: 13,770 MB
Virtual Memory: In Use:    6,492 MB
Page File Location(s):     C:\pagefile.sys
Domain:                    WORKGROUP
Logon Server:              N/A
Hotfix(s):                 4 Hotfix(s) Installed.
                           [01]: KB5037591
                           [02]: KB5027397
                           [03]: KB5040442
                           [04]: KB5039338
Network Card(s):           2 NIC(s) Installed.
                           [01]: Realtek RTL8821CE 802.11ac PCIe Adapter
                                 Connection Name: Wi-Fi
                                 DHCP Enabled:    Yes
                                 DHCP Server:     192.168.115.69
                                 IP address(es)
                                 [01]: 192.168.115.162
                           [02]: Bluetooth Device (Personal Area Network)
                                 Connection Name: Bluetooth Network Connection
                                 Status:          Media disconnected
Hyper-V Requirements:      A hypervisor has been detected. Features required for Hyper-V will not be displayed.
-----------------------------------------------------------------
Error: Level Zero Initialization Error
xpu-smi is not installed properly.

I installed one api and ran this command while being in the llm-cpp enviorment and folder

sgwhat commented 1 month ago

In your output, it shows Error: Level Zero Initialization Error, which is the reason why your Ollama cannot run the model. You may refer to this guide to install prerequisites on your device.

EMPERORAYUSH commented 1 month ago

@sgwhat So after installation (and completing the complete guide and running the qwen 1.8b model, how can I run it on ollama? Should I clear my miniforge and go again scrach?

EMPERORAYUSH commented 1 month ago

@sgwhat i created a new enviorment, downloaded all the pip packages and then ran the python code. But after importing ipexllm.transformers, i saw this error immidietly (although the import was successful) : C:\Users\ayush\miniforge-pypy3\envs\llm\Lib\site-packages\intel_extension_for_pytorch\xpu\lazy_init.py:80: UserWarning: XPU Device count is zero! (Triggered internally at C:/Users/arc/ruijie/2.1_RC3/python311/frameworks.ai.pytorch.ipex-gpu/csrc/gpu/runtime/Device.cpp:127.) _C._initExtension()

Also, when i set the tensor_1: tensor_1 = torch.randn(1, 1, 40, 128).to('xpu') Python just crashes (goes out of the python interactive shell) automatically

sgwhat commented 1 month ago

Have you downloaded and installed the GPU driver from the official Intel download page?

EMPERORAYUSH commented 1 month ago

@sgwhat i have a UHD graphics so if i download ARC or Iris Xe drivers, my motherboard would brick. Thats why i didnt download the drivers. Also, it was optional as written in the docs

sgwhat commented 1 month ago

You may need to install the GPU driver from official Intel download page, and it works on UHD graphics after our verification. Please install the latest version of the GPU driver as required, while optional only refers to upgrading the driver.

vinixwu commented 1 month ago

I also ran into this problem and read posts above. I tried installing the latest version (32.0.101.5768) of GPU driver from URL offered by @sgwhat , however the installer said not found any driver that can be installed for this device, installer exit code 8 (translated from Chinese).

My CPU is i5-10400, and the latest driver on Intel Download that support this CPU is 31.0.101.2128, so it means this CPU can not be used by IPEX-LLM for Ollama?

sgwhat commented 1 month ago

Hi @vinixwu , the 10th generation CPU has not been tested by us, but you may have a try to install the GPU driver and run ollama.