Closed raj-ritu17 closed 6 months ago
Could you please try to add PARAMETER use_mmap false
in Modelfile as below and create the model again?
FROM /home/rajritu/ritu/ipex-llm/models/7b/llama-2-7b.Q4_K_M.gguf
TEMPLATE [INST] {{ .Prompt }} [/INST]
PARAMETER num_gpu 999
PARAMETER num_predict 64
PARAMETER use_mmap false
@sgwhat .. perfect, that's works.
but now one another problem , if I am running without Modelfile, I am facing different issue.
level=ERROR source=server.go:281 msg="error starting llama server" server=cpu_avx2 error="llama runner process no longer running: 1 "
below is the example:
(llm-cpp-V2) rajritu@IMU-NEX-EMR1-SUT:~/ritu/ipex-llm/ollama$ ./ollama run llama2
Error: llama runner process no longer running: 1
server error:
found 4 SYCL devices:
| | | |Compute |Max compute|Max work|Max sub| |
|ID| Device Type| Name|capability|units |group |group |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]| Intel(R) Data Center GPU Flex 170| 1.3| 512| 1024| 32| 14193102848|
| 1| [opencl:gpu:0]| Intel(R) Data Center GPU Flex 170| 3.0| 512| 1024| 32| 14193102848|
| 2| [opencl:cpu:0]| INTEL(R) XEON(R) PLATINUM 8580| 3.0| 240| 8192| 64| 67113836544|
| 3| [opencl:acc:0]| Intel(R) FPGA Emulation Device| 1.2| 240|67108864| 64| 67113836544|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 3647.87 MiB
..................................................................................................
Specified ngl 0 is smaller than model n_layer 32; please set -ngl equal to or larger than 32
time=2024-04-19T09:52:21.196+02:00 level=ERROR source=server.go:281 msg="error starting llama server" server=cpu_avx2 **error="llama runner process no longer running: 1 "**
time=2024-04-19T09:52:21.196+02:00 level=ERROR source=server.go:289 msg="unable to load any llama server" error="llama runner process no longer running: 1 "
[GIN] 2024/04/19 - 09:52:21 | 500 | 1.936408145s | 127.0.0.1 | POST "/api/chat"
Hi @raj-ritu17, you can use pip install --pre --upgrade ipex-llm[cpp]
to install the latest version of ollama, we just support to set export OLLAMA_NUM_GPU=999
to run the model on gpu, instead of setting this in Modelfile
.
Please see https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#run-ollama-serve for more new details.
server-side:
Exception caught at file:/home/arda/actions-runner/bigdl-core-cpp-build/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:17037, func:operator()
client-side:
I have tried to run ollama on flex-170 but failed to run the model. I have followed, following document: doc
intel flex and one-api is installed on machine, below is gpu information
Modefile content:
ollama-list:
also, have tried via api call: