Closed petrasS3 closed 6 months ago
Might be worth mentioning the 500$ reward in this issue #392
@petrasS3 There is fast implementation with merge request https://github.com/vllm-project/vllm/pull/762. Would this qualify? Just working on some cosmetic code-style improvements then it will be mergable.
@petrasS3 There is fast implementation with merge request #762. Would this qualify? Just working on some cosmetic code-style improvements then it will be mergable.
omg this is amazing, did you test out TheBloke models and especially the GPTQ ones
i tried to implement using GPTQ but AWQ quantization was faster and better inference quality.
https://github.com/mit-han-lab/llm-awq/blob/main/scripts/llama_example.sh here is a link to example of quantizing using AWQ, its very straightforward
Does this mean GPTQ models cannot be loaded with this? Yes, AWQ is faster, but there are not that many models for it.
Just having "load in 8-bit" support alone would be fine as a first step. I would like to run Llama 2 13B and WizardCoder 15B (StarCoder architecture) on a 24GB GPU.
Did you try it with GPTQ?
Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ :smile:
model | llm-awq | mlc-llm | lmdeploy |
---|---|---|---|
Llama-2-7B-chat | 112.9 | 159.4 | 206.4 |
Llama-2-13B-chat | N/A | 90.7 | 115.8 |
Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively,
model | 16bit(2048) | 4bit(2048) | 16bit(4096) | 4bit(4096) |
---|---|---|---|---|
Llama-2-7B-chat | 15.1 | 6.3 | 16.2 | 7.5 |
Llama-2-13B-chat | OOM | 10.3 | OOM | 12.0 |
Does this mean GPTQ models cannot be loaded with this? Yes, AWQ is faster, but there are not that many models for it.
I will be starting to upload AWQ models to HF soon, hopefully in the next few days.
Working with AWQ models will also get easier soon enough when #72 is merged in AWQ. The plan is to make it easier to use so we can improve it.
Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ π
model llm-awq mlc-llm lmdeploy Llama-2-7B-chat 112.9 159.4 206.4 Llama-2-13B-chat N/A 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively,
model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0
Where can I find the docs ?
Does this mean GPTQ models cannot be loaded with this? Yes, AWQ is faster, but there are not that many models for it.
I will be starting to upload AWQ models to HF soon, hopefully in the next few days.
Will you share the code for quantizing AWQ ?
Does this mean GPTQ models cannot be loaded with this? Yes, AWQ is faster, but there are not that many models for it.
I will be starting to upload AWQ models to HF soon, hopefully in the next few days.
Will you share the code for quantizing AWQ ?
I'm expecting to just use the code from the main AWQ repo - it's documented here: https://github.com/mit-han-lab/llm-awq#usage
Looks easy enough!
Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ π model llm-awq mlc-llm lmdeploy Llama-2-7B-chat 112.9 159.4 206.4 Llama-2-13B-chat N/A 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively, model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0
Where can I find the docs ?
https://github.com/InternLM/lmdeploy/blob/main/docs/en/w4a16.md
Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ π model llm-awq mlc-llm lmdeploy Llama-2-7B-chat 112.9 159.4 206.4 Llama-2-13B-chat N/A 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively, model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0
Where can I find the docs ?
https://github.com/InternLM/lmdeploy/blob/main/docs/en/w4a16.md
does it support the docker container because it keeps giving an error?
[ 96%] Built target triton-common-logging
[ 97%] Built target triton-common-thread-pool
[ 98%] Built target triton-common-table-printer
[ 99%] Linking CXX executable ../../../bin/llama_triton_example
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o: in function `MPI::Op::Init(void (*)(void const*, void*, int, MPI::Datatype const&), bool)':
/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/op_inln.h:121: undefined reference to `ompi_mpi_cxx_op_intercept'
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o: in function `MPI::Intracomm::Intracomm(ompi_communicator_t*)':
/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm_inln.h:23: undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o: in function `MPI::Intracomm::Intracomm()':
/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o:/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm.h:25: more undefined references to `MPI::Comm::Comm()' follow
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o:(.data.rel.ro._ZTVN3MPI8DatatypeE[_ZTVN3MPI8DatatypeE]+0x78): undefined reference to `MPI::Datatype::Free()'
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o:(.data.rel.ro._ZTVN3MPI3WinE[_ZTVN3MPI3WinE]+0x48): undefined reference to `MPI::Win::Free()'
collect2: error: ld returned 1 exit status
make[2]: *** [examples/cpp/llama/CMakeFiles/llama_triton_example.dir/build.make:132: bin/llama_triton_example] Error 1
make[1]: *** [CMakeFiles/Makefile2:3040: examples/cpp/llama/CMakeFiles/llama_triton_example.dir/all] Error 2
make: *** [Makefile:156: all] Error 2
It supports docker. Can you open an issue in lmdeploy and tell us how to reproduce it?
Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ π
model llm-awq mlc-llm lmdeploy Llama-2-7B-chat 112.9 159.4 206.4 Llama-2-13B-chat N/A 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively,
model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0
Are these numbers measured at batch size 1? Also, how to reproduce these numbers?
Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ π model llm-awq mlc-llm lmdeploy Llama-2-7B-chat 112.9 159.4 206.4 Llama-2-13B-chat N/A 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively, model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0
Are these numbers measured at batch size 1? Also, how to reproduce these numbers?
Yes. batch size is 1. You can use benchmark/profile_generation.py
to reproduce it
What's the progress here? Do we use LMDeploy or will vllm have support for this?
What's the progress here? Do we use LMDeploy or will vllm have support for this?
It looks like #762 is close to being merged. I was sad to see "currently does not support tensor parallelism" but there is already a fork being worked on for that.
I only just learned of skypilot and started using it today. I think once that fork becomes a PR, AWQ quantized models can be deployed with skypilot as well?
LMDeploy (>=v0.0.6) supports tensor parallelism for 4-bit quantized LLM model inference
LMDeploy (>=v0.0.6) supports tensor parallelism for 4-bit quantized LLM model inference
What's the plan to support more models like MPT or Falcon?
The only models I use are either Llama2-GPTQ or Flan-T5, so I can't use vLLM whatsoever. π’
The only models I use are either Llama2-GPTQ or Flan-T5, so I can't use vLLM whatsoever. π’
What do you mean by you canβt use it?
The only models I use are either Llama2-GPTQ or Flan-T5, so I can't use vLLM whatsoever. π’
What do you mean by you canβt use it?
vLLM doesn't support GPTQ models or T5 models, from what I can tell.
The only models I use are either Llama2-GPTQ or Flan-T5, so I can't use vLLM whatsoever. π’
What do you mean by you canβt use it?
vLLM doesn't support GPTQ models or T5 models, from what I can tell.
You should use AWQ models either way as they are better.
823 is greater than 47.
823 is greater than 47.
I already talked with TheBloke, more models will come for AWQ.
does anyone know if VLLM support quantised models yet? Reading through these comments it looks like AWQ might be supported?
edit: apparently these guys have pulled it off here: https://github.com/vllm-project/vllm/pull/762
Is it possible to now directly use the AWQ models from HF Hub now?
Does AWQ support higher than 4 bits per weight? For example 8 bits like GGUF?
Does AWQ support higher than 4 bits per weight? For example 8 bits like GGUF?
AWQ supports 4 bits per weight. Probably 8 bits will come at a later time.
I tried the custom AWQ model with the latest vllm=0.2.0 - it allows to load a quantized model but without performance gain - I guess because it's dequantized during the load. https://github.com/vllm-project/vllm/commit/2b1c116b5acdf3b738e310f98617875132214c37#diff-074bf1408c395d4d7dfaa07aaafaf6ebebb2b3bde50cdbbb868cc95d23daffb5
Closing as this is now resolved.
so who won the $500 π
We need to add support for the quantized model in the VLLM project. We need this to run a llama quantized model via vllm. This involves implementing quantization techniques to optimize memory usage and runtime performance. A reward of $500 will be granted to the contributor who successfully completes this task.