Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.56k
stars
1.25k
forks
source link
Improve First Token Latency for multi-GPU projects (by flash attention or alternative) #10897
For multi-GPU solution, we still have challenges for First Token Latency. The breakdown data is shared in offline.
please help add more optimization features (like SDP/Flash Attention etc) to improve the First Token Latency.
For multi-GPU solution, we still have challenges for First Token Latency. The breakdown data is shared in offline. please help add more optimization features (like SDP/Flash Attention etc) to improve the First Token Latency.