Helping Speed up Inference

ri938 commented 1 year ago

Hi,

I am trying to integrate AWQ into vLLM library. The current issue is that AWQ has worse throughput than the unquantised variant: I think it should at least match this.

Issues when profiling AWQ gemm kernels: i) hitting limiting of shared memory means low occupancy of GPU ii) lots of bank conflicts when trying to store to shared memory

I think if resolved this could speed up inference a lot.

Are any of the maintainers willing to discuss more - can add me on discord: "robert1". Main issue at moment is trying to understand some parts of the kernel.

tonylins commented 1 year ago

Hi, thanks for the pointer. Maybe @kentang-mit and @ys-2020 can take a look at the issue?

abhinavkulkarni commented 1 year ago

Hi @ri938,

Can you please try to benchmark AWQ with HuggingFace's text-generation-interface instead of the native HuggingFace model.generate method? I have a fork that supports loading AWQ models in TGI.

For e.g., you can load AWQ models as follows (after building the source):

text-generation-launcher --huggingface-hub-cache ~/.cache/huggingface/hub/ --model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq --trust-remote-code --port 8080 --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 --quantize awq

Please install Flash Attention v1 or v2 and vLLM so that you benefit from PagedAttention and FlashAttention?

mit-han-lab / llm-awq

Helping Speed up Inference #80