Replace FasterTransformers like KV cache layout and kernel with flash attention for better support for longer sequence

mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

MIT License

2.55k stars 207 forks source link

Open JerryGJX opened 1 week ago

JerryGJX commented 1 week ago

Fix unexpected args behavior
Replace FasterTransformers like KV cache layout and kernel with flash attention for better support for efficient inference of longer sequence