mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
MIT License
2.55k stars 207 forks source link

Replace FasterTransformers like KV cache layout and kernel with flash attention for better support for longer sequence #239

Open JerryGJX opened 1 week ago

JerryGJX commented 1 week ago