Improve First Token Latency for multi-GPU projects (by flash attention or alternative)

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Apache License 2.0

6.56k stars 1.25k forks source link

Improve First Token Latency for multi-GPU projects (by flash attention or alternative) #10897

Open moutainriver opened 5 months ago

moutainriver commented 5 months ago

For multi-GPU solution, we still have challenges for First Token Latency. The breakdown data is shared in offline. please help add more optimization features (like SDP/Flash Attention etc) to improve the First Token Latency.

yangqing-yq commented 3 weeks ago

mark it