Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
new batch kernel supports more device, requires state_size % 128 == 0 instead of state_size % 256 == 0, remove output_size % 32 == 0, but it supports max batch to 48 instead of 64
1. Why the change?
2. User API changes
3. Summary of the change
4. How to test?
[ ] N/A
[ ] Unit test: Please manually trigger the PR Validation here by inputting the PR number (e.g., 1234). And paste your action link here once it has been successfully finished.
Description
new batch kernel supports more device, requires state_size % 128 == 0 instead of state_size % 256 == 0, remove output_size % 32 == 0, but it supports max batch to 48 instead of 64
1. Why the change?
2. User API changes
3. Summary of the change
4. How to test?
1234
). And paste your action link here once it has been successfully finished.