intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.45k stars 1.24k forks source link

Flex 170 GPU ollama unbale to detect GPU and sycl-ls also not detecting it !!! #10801

Open shailesh837 opened 4 months ago

shailesh837 commented 4 months ago

I am testing old ollama bin file with llama2:7b model on Intel Flex 170 GPU: I have followed new ipex-llm documentation for driver install and rest steps, but when i am running ollama serve it doesn't detect GPU also i see some dll error when ollama serve run. Attaching ollama serve log in issue : ollama_serve_flex_170.log

spandey2@IMU-NEX-EMR1-SUT:~/LLM_SceneScape_ChatBot$ sudo xpu-smi discovery +-----------+--------------------------------------------------------------------------------------+ Device ID Device Information +-----------+--------------------------------------------------------------------------------------+ 0 Device Name: Intel(R) Data Center GPU Flex 170
Vendor Name: Intel(R) Corporation
SOC UUID: 00000000-0000-0000-dc8d-d308092ec026
PCI BDF Address: 0000:29:00.0
DRM Device: /dev/dri/card1
Function Type: physical

+-----------+--------------------------------------------------------------------------------------+

spandey2@IMU-NEX-EMR1-SUT:~/LLM_SceneScape_ChatBot$ sycl-ls [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix] [opencl:cpu:1] Intel(R) OpenCL, INTEL(R) XEON(R) PLATINUM 8580 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]

Modelfile : FROM llama2:latest PARAMETER num_gpu 999 PARAMETER temperature 0 PARAMETER num_ctx 4096 PARAMETER use_mmap false

Ollama serve log: time=2024-04-18T23:44:00.314+02:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libze_intel_gpu.so" time=2024-04-18T23:44:00.315+02:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.27642.40]" time=2024-04-18T23:44:00.345+02:00 level=INFO source=gpu.go:377 msg="Unable to load oneAPI management library /usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.27642.40: oneapi vram init failure: 2013265921" time=2024-04-18T23:44:00.345+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-18T23:44:00.345+02:00 level=INFO source=routes.go:1044 msg="no GPU detected" [GIN] 2024/04/18 - 23:45:10 | 200 | 120.449µs | 127.0.0.1 | HEAD "/" [GIN] 2024/04/18 - 23:45:10 | 200 | 1.357185ms | 127.0.0.1 | GET "/api/tags" time=2024-04-18T23:46:49.533+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-18T23:46:49.533+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-18T23:46:49.533+02:00 level=INFO source=llm.go:77 msg="GPU not available, falling back to CPU" loading library /tmp/ollama1871007333/cpu_avx2/libext_server.so

llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/33 layers to GPU llm_load_tensors: CPU buffer size = 3647.87 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 2048.00 MiB llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB llama_new_context_with_model: CPU input buffer size = 16.02 MiB llama_new_context_with_model: CPU compute buffer size = 308.00 MiB

sudo xpu-smi stats -d 0 IMU-NEX-EMR1-SUT: Thu Apr 18 23:49:41 2024

+-----------------------------+--------------------------------------------------------------------+ Device ID 0 +-----------------------------+--------------------------------------------------------------------+ GPU Utilization (%) 0 EU Array Active (%) N/A EU Array Stall (%) N/A EU Array Idle (%) N/A
Compute Engine Util (%) 0; Engine 0: 0, Engine 1: 0, Engine 2: 0, Engine 3: 0
Render Engine Util (%) 0; Engine 0: 0
Media Engine Util (%) 0
Decoder Engine Util (%) Engine 0: 0, Engine 1: 0
Encoder Engine Util (%) Engine 0: 0, Engine 1: 0
Copy Engine Util (%) 0; Engine 0: 0
Media EM Engine Util (%) Engine 0: 0, Engine 1: 0
3D Engine Util (%) N/A

+-----------------------------+--------------------------------------------------------------------+ | Reset | N/A | | Programming Errors | N/A | | Driver Errors | N/A | | Cache Errors Correctable | N/A | | Cache Errors Uncorrectable | N/A | | Mem Errors Correctable | N/A | | Mem Errors Uncorrectable | N/A | +-----------------------------+--------------------------------------------------------------------+ | GPU Power (W) | 43 | | GPU Frequency (MHz) | 2050 | | Media Engine Freq (MHz) | 1025 | | GPU Core Temperature (C) | 87 | | GPU Memory Temperature (C) | N/A | | GPU Memory Read (kB/s) | N/A | | GPU Memory Write (kB/s) | N/A | | GPU Memory Bandwidth (%) | 0 | | GPU Memory Used (MiB) | 31 | | GPU Memory Util (%) | 0 | | Xe Link Throughput (kB/s) | N/A | +-----------------------------+--------------------------------------------------------------------+

sgwhat commented 4 months ago

Hi Shailesh, I didn't see any gpu device from sycl-ls in your log. Could you please check your oneapi installation and remember to source /opt/intel/oneapi/setvars.sh?

spandey2@IMU-NEX-EMR1-SUT:~/LLM_SceneScape_ChatBot$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, INTEL(R) XEON(R) PLATINUM 8580 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
shailesh837 commented 4 months ago

Issue was libllama_bigdl_core.so was missing in /usr/lib folder, But there are 2 important issues we are seeing : a) why we need to create Modelfile with below parameters: FROM llama2:latest PARAMETER num_gpu 999 PARAMETER temperature 0 PARAMETER num_ctx 4096 PARAMETER use_mmap false As older version of ollama was working without modelfile and we don't need to set it to GPU, it was running all layers to GPU.

b) The Loading of model by ollama serve takes 50 seconds and response takes 20 seconds ? why Make model and same ollama serve on Arc GPU 770 [16 GB GPU memory , same as Flex 170] does loading in 10 seconds or less and response is 1-2 seconds

sgwhat commented 4 months ago

Hi @shailesh837,

  1. You may switch to our latest release version of Ollama by using the command pip install --pre --upgrade ipex-llm[cpp]. In this version, libllama_bigdl_core.so is no longer required.
  2. In the latest version of Ollama, it is not necessary to set PARAMETER num_gpu 999 in the Model file. The usage remains the same as in the previous versions.
  3. Please ensure that Ollama is running on a GPU device. We will investigate the causes of the reduced performance on Flex170.