intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.51k stars 1.24k forks source link

Support for max_loaded_maps and num_parallel variables/parameter #11225

Open jars101 opened 3 months ago

jars101 commented 3 months ago

It does not seem that ollama running on ipex-llm supports the most recent max_loaded_maps and num_parallel variables/parameters. Is it supported in current ollama version under llama-cpp? how does one enables it? thanks

sgwhat commented 3 months ago

Hi @jars101 ,

  1. Ollama does not support max_loaded_maps.
  2. You may run the command below to enable num_parallel setting:
    export OLLAMA_NUM_PARALLEL=2
    ./ollama serve
jars101 commented 3 months ago

Thank you @sgwhat , most recent versions of ollama do support both OLLAMA_MAX_LOADED_MAPS and OLLAMA_NUM_PARALELL for linux and windows. Running ollama through cuda(ipex-llm) does not seem to keep the settings since for every request on same model it reloads the model into memory. This behaviour on ollama for windows (standalone) does not occur.

llm-cpp snippet log:

base) C:\Users\Admin>conda activate llm-cpp

(llm-cpp) C:\Users\Admin>ollama serve 2024/06/05 04:57:07 routes.go:1008: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0:] OLLAMA_RUNNERS_DIR:C:\Users\Admin\AppData\Local\Programs\Ollama\ollama_runners OLLAMA_TMPDIR:]" time=2024-06-05T04:57:07.321-07:00 level=INFO source=images.go:704 msg="total blobs: 78" time=2024-06-05T04:57:07.351-07:00 level=INFO source=images.go:711 msg="total unused blobs removed: 0" time=2024-06-05T04:57:07.361-07:00 level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)"

jars101 commented 3 months ago

My bad, OLLAMA_NUM_PARALELL does work but OLLAMA_MAX_LOADED_MAPS does not. I went ahead and deployed a new installation of llm-cpp+ollama and I see that now i can make use of both variables. Also, ollama ps is available as well. The only problem I see when setting OLLAMA_KEEP_ALIVE to 600 seconds for instance is the following error:

INFO [print_timings] total time = 38167.02 ms | slot_id=0 t_prompt_processing=2029.651 t_token_generation=36137.372 t_total=38167.023 task_id=3 tid="3404" timestamp=1717656974 [GIN] 2024/06/05 - 23:56:14 | 200 | 47.9829961s | 10.240.0.1 | POST "/api/chat" Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) Exception caught at file:C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:16685, func:operator() SYCL error: CHECK_TRY_ERROR((*stream) .memcpy((char *)tensor->data + offset, host_buf, size) .wait()): Meet error in this line code! in function ggml_backend_sycl_buffer_set_tensor at C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:16685 GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error" [GIN] 2024/06/05 - 23:58:18 | 200 | 4.1697536s | 10.240.0.1 | POST "/api/chat" Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) Exception caught at file:C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:17384, func:operator() SYCL error: CHECK_TRY_ERROR(g_syclStreams[sycl_ctx->device][0]->memcpy( (char *)tensor->data + offset, data, size).wait()): Meet error in this line code! in function ggml_backend_sycl_set_tensor_async at C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:17384 GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"

Further removing OLLAMA_KEEP_ALIVE and letting it be default of 5miunutes, im observiing the same issue.

sgwhat commented 3 months ago

Could you please share the output of pip list from your environment and also your GPU model? Additionally, it would be helpful for us to resolve the issue if you could provide more information from the Ollama Server side.