Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
[x] Unit test: Please manually trigger the PR Validation here by inputting the PR number (e.g., 1234). And paste your action link here once it has been successfully finished.
[ ] Application test
[ ] Document test
[ ] ...
5. Known issues
[x] Sometimes, this will fail on initial start up, and got timeout error...
Description
Updates for vLLm to using vLLM 0.6.2.
We need to change the followings:
ipex-llm/python/llm/example/GPU/vLLM-Serving
1. Why the change?
2. User API changes
3. Summary of the change
4. How to test?
1234
). And paste your action link here once it has been successfully finished.5. Known issues