intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.76k stars 1.27k forks source link

Running llama3.1 in ollama/langchain fails. #12111

Open tkarna opened 2 months ago

tkarna commented 2 months ago

After updating ipex-llm, running llama3.1 through langchain and ollama no longer works. A simple reproducer:

# pip install langchain langchain_community
from langchain_community.llms import Ollama

# ollama pull llama3.1:70b-instruct-q4_K_M
llm = Ollama(model="llama3.1:70b-instruct-q4_K_M")
response = llm.invoke("What is the capital of France?")
print(response)

Last know working ipex-llm version is 2.2.0b20240826. Tested on Ubuntu 22.04, oneAPI 2024.02 (intel-basekit 2024.2.1-98) with two Intel(R) Data Center GPU Max 1100 GPUs.

Error message:

[1727090840] warming up the model with an empty run
ollama_llama_server: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:428: auto ggml_sycl_op_sdp_xmx_casual(fp16 *, fp16 *, fp16 *, fp16 *, fp16 *, float *, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, bool, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.
time=2024-09-23T11:27:23.172Z level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error"
time=2024-09-23T11:27:23.423Z level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)"
leonardozcm commented 2 months ago

hi, I think we have fix this in latest pr, may you try ipex-llm[cpp] >=2.2.0b20240924 tomorrow?

tkarna commented 2 months ago

Thanks, I confirm that the simple example works now. However, when running a larger langchain agents workflow I'm still getting an error:

/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml/src/ggml-backend.c:96: GGML_ASSERT(base != NULL && "backend buffer base cannot be NULL") failed

I'll see if I can make a small reproducer.

tklengyel commented 2 months ago

I still have this issue using Ollama and Open WebUI with llama3.1 as of 2.2.0b20240927.


ollama_llama_server: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:429: auto ggml_sycl_op_sdp_xmx_casual(fp16 *, fp16 *, fp16 *, fp16 *, fp16 *, float *, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, bool, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.
time=2024-09-27T18:26:03.643-04:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error"
time=2024-09-27T18:26:03.893-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)"