Open ri938 opened 1 year ago
Hi, thanks for the pointer. Maybe @kentang-mit and @ys-2020 can take a look at the issue?
Hi @ri938,
Can you please try to benchmark AWQ with HuggingFace's text-generation-interface
instead of the native HuggingFace model.generate
method? I have a fork that supports loading AWQ models in TGI.
For e.g., you can load AWQ models as follows (after building the source):
text-generation-launcher --huggingface-hub-cache ~/.cache/huggingface/hub/ --model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq --trust-remote-code --port 8080 --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 --quantize awq
Please install Flash Attention v1 or v2 and vLLM so that you benefit from PagedAttention and FlashAttention?
Hi,
I am trying to integrate AWQ into vLLM library. The current issue is that AWQ has worse throughput than the unquantised variant: I think it should at least match this.
Issues when profiling AWQ gemm kernels: i) hitting limiting of shared memory means low occupancy of GPU ii) lots of bank conflicts when trying to store to shared memory
I think if resolved this could speed up inference a lot.
Are any of the maintainers willing to discuss more - can add me on discord: "robert1". Main issue at moment is trying to understand some parts of the kernel.