intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.69k stars 1.26k forks source link

Exception while running ollama, caught ggml-sycl.cpp, line:17037, func:operator() #10797

Closed raj-ritu17 closed 6 months ago

raj-ritu17 commented 6 months ago

server-side: Exception caught at file:/home/arda/actions-runner/bigdl-core-cpp-build/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:17037, func:operator()

client-side:

(llm-cpp-V2) rajritu@IMU-NEX-EMR1-SUT:~/ritu/ipex-llm/ollama$ ./ollama run mario
Error: llama runner process no longer running: -1 error:CHECK_TRY_ERROR((*stream) .memcpy((char *)tensor->data + offset, data, size) .wait()): Meet error in this line code!
  in function ggml_backend_sycl_buffer_set_tensor at /home/arda/actions-runner/bigdl-core-cpp-build/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:17037
GGML_ASSERT: /home/arda/actions-runner/bigdl-core-cpp-
[ollama-log.txt](https://github.com/intel-analytics/ipex-llm/files/15023332/ollama-log.txt)
build/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"

I have tried to run ollama on flex-170 but failed to run the model. I have followed, following document: doc

intel flex and one-api is installed on machine, below is gpu information

+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Data Center GPU Flex 170                                       |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0000-dc8d-d308092ec026                                       |
|           | PCI BDF Address: 0000:29:00.0                                                        |
|           | DRM Device: /dev/dri/card1                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+

Modefile content:

FROM /home/rajritu/ritu/ipex-llm/models/7b/llama-2-7b.Q4_K_M.gguf
TEMPLATE [INST] {{ .Prompt }} [/INST]
PARAMETER num_gpu 999
PARAMETER num_predict 64

ollama-list:

NAME                    ID              SIZE    MODIFIED
dolphin-phi:latest      c5761fc77240    1.6 GB  19 hours ago
example:latest          28068906de65    4.4 GB  22 hours ago
llama2:latest           78e26419b446    3.8 GB  19 hours ago
llama7bGGUF:latest      99dc0251bd3a    4.1 GB  44 minutes ago
mario:latest            04a916eb4c79    3.8 GB  3 hours ago

also, have tried via api call:

curl http://localhost:11434/api/generate -d '
{ 
   "model": "dolphin-phi", 
   "prompt": "Why is the sky blue?", 
   "stream": false,
   "options":{"num_gpu": 999}
}'
sgwhat commented 6 months ago

Could you please try to add PARAMETER use_mmap false in Modelfile as below and create the model again?

FROM /home/rajritu/ritu/ipex-llm/models/7b/llama-2-7b.Q4_K_M.gguf
TEMPLATE [INST] {{ .Prompt }} [/INST]
PARAMETER num_gpu 999
PARAMETER num_predict 64
PARAMETER use_mmap false
raj-ritu17 commented 6 months ago

@sgwhat .. perfect, that's works. but now one another problem , if I am running without Modelfile, I am facing different issue. level=ERROR source=server.go:281 msg="error starting llama server" server=cpu_avx2 error="llama runner process no longer running: 1 "

below is the example:

(llm-cpp-V2) rajritu@IMU-NEX-EMR1-SUT:~/ritu/ipex-llm/ollama$ ./ollama run llama2
Error: llama runner process no longer running: 1

server error:

found 4 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|            Intel(R) Data Center GPU Flex 170|       1.3|        512|    1024|     32|    14193102848|
| 1|    [opencl:gpu:0]|            Intel(R) Data Center GPU Flex 170|       3.0|        512|    1024|     32|    14193102848|
| 2|    [opencl:cpu:0]|               INTEL(R) XEON(R) PLATINUM 8580|       3.0|        240|    8192|     64|    67113836544|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|        240|67108864|     64|    67113836544|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
Specified ngl 0 is smaller than model n_layer 32; please set -ngl equal to or larger than 32
time=2024-04-19T09:52:21.196+02:00 level=ERROR source=server.go:281 msg="error starting llama server" server=cpu_avx2 **error="llama runner process no longer running: 1 "**
time=2024-04-19T09:52:21.196+02:00 level=ERROR source=server.go:289 msg="unable to load any llama server" error="llama runner process no longer running: 1 "
[GIN] 2024/04/19 - 09:52:21 | 500 |  1.936408145s |       127.0.0.1 | POST     "/api/chat"
sgwhat commented 6 months ago

Hi @raj-ritu17, you can use pip install --pre --upgrade ipex-llm[cpp] to install the latest version of ollama, we just support to set export OLLAMA_NUM_GPU=999 to run the model on gpu, instead of setting this in Modelfile.

sgwhat commented 6 months ago

Please see https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#run-ollama-serve for more new details.