[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
2.55k
stars
207
forks
source link
Replace FasterTransformers like KV cache layout and kernel with flash attention for better support for longer sequence #239
Open
JerryGJX opened 1 week ago