Add Support for Quantized Model in VLLM - $500 Reward

petrasS3 commented 1 year ago

We need to add support for the quantized model in the VLLM project. We need this to run a llama quantized model via vllm. This involves implementing quantization techniques to optimize memory usage and runtime performance. A reward of $500 will be granted to the contributor who successfully completes this task.

PadLex commented 1 year ago

Might be worth mentioning the 500$ reward in this issue #392

ri938 commented 1 year ago

@petrasS3 There is fast implementation with merge request https://github.com/vllm-project/vllm/pull/762. Would this qualify? Just working on some cosmetic code-style improvements then it will be mergable.

petrasS3 commented 1 year ago

@petrasS3 There is fast implementation with merge request #762. Would this qualify? Just working on some cosmetic code-style improvements then it will be mergable.

omg this is amazing, did you test out TheBloke models and especially the GPTQ ones

ri938 commented 1 year ago

i tried to implement using GPTQ but AWQ quantization was faster and better inference quality.

https://github.com/mit-han-lab/llm-awq/blob/main/scripts/llama_example.sh here is a link to example of quantizing using AWQ, its very straightforward

petrasS3 commented 1 year ago

Does this mean GPTQ models cannot be loaded with this? Yes, AWQ is faster, but there are not that many models for it.

viktor-ferenczi commented 1 year ago

Just having "load in 8-bit" support alone would be fine as a first step. I would like to run Llama 2 13B and WizardCoder 15B (StarCoder architecture) on a 24GB GPU.

petrasS3 commented 1 year ago

Did you try it with GPTQ?

lvhan028 commented 1 year ago

Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ :smile:

model	llm-awq	mlc-llm	lmdeploy
Llama-2-7B-chat	112.9	159.4	206.4
Llama-2-13B-chat	N/A	90.7	115.8

Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively,

model	16bit(2048)	4bit(2048)	16bit(4096)	4bit(4096)
Llama-2-7B-chat	15.1	6.3	16.2	7.5
Llama-2-13B-chat	OOM	10.3	OOM	12.0

TheBloke commented 1 year ago

Does this mean GPTQ models cannot be loaded with this? Yes, AWQ is faster, but there are not that many models for it.

I will be starting to upload AWQ models to HF soon, hopefully in the next few days.

casper-hansen commented 1 year ago

Working with AWQ models will also get easier soon enough when #72 is merged in AWQ. The plan is to make it easier to use so we can improve it.

petrasS3 commented 1 year ago

Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ 😄

model llm-awq mlc-llm lmdeploy Llama-2-7B-chat 112.9 159.4 206.4 Llama-2-13B-chat N/A 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively,

model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0

Where can I find the docs ?

petrasS3 commented 1 year ago

Does this mean GPTQ models cannot be loaded with this? Yes, AWQ is faster, but there are not that many models for it.

I will be starting to upload AWQ models to HF soon, hopefully in the next few days.

Will you share the code for quantizing AWQ ?

TheBloke commented 1 year ago

Does this mean GPTQ models cannot be loaded with this? Yes, AWQ is faster, but there are not that many models for it.

I will be starting to upload AWQ models to HF soon, hopefully in the next few days.

Will you share the code for quantizing AWQ ?

I'm expecting to just use the code from the main AWQ repo - it's documented here: https://github.com/mit-han-lab/llm-awq#usage

Looks easy enough!

lvhan028 commented 1 year ago

Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ 😄 model llm-awq mlc-llm lmdeploy Llama-2-7B-chat 112.9 159.4 206.4 Llama-2-13B-chat N/A 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively, model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0

Where can I find the docs ?

https://github.com/InternLM/lmdeploy/blob/main/docs/en/w4a16.md

petrasS3 commented 1 year ago

Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ 😄 model llm-awq mlc-llm lmdeploy Llama-2-7B-chat 112.9 159.4 206.4 Llama-2-13B-chat N/A 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively, model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0

Where can I find the docs ?

https://github.com/InternLM/lmdeploy/blob/main/docs/en/w4a16.md

does it support the docker container because it keeps giving an error?

[ 96%] Built target triton-common-logging
[ 97%] Built target triton-common-thread-pool
[ 98%] Built target triton-common-table-printer
[ 99%] Linking CXX executable ../../../bin/llama_triton_example
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o: in function `MPI::Op::Init(void (*)(void const*, void*, int, MPI::Datatype const&), bool)':
/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/op_inln.h:121: undefined reference to `ompi_mpi_cxx_op_intercept'
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o: in function `MPI::Intracomm::Intracomm(ompi_communicator_t*)':
/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm_inln.h:23: undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o: in function `MPI::Intracomm::Intracomm()':
/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o:/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/intracomm.h:25: more undefined references to `MPI::Comm::Comm()' follow
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o:(.data.rel.ro._ZTVN3MPI8DatatypeE[_ZTVN3MPI8DatatypeE]+0x78): undefined reference to `MPI::Datatype::Free()'
/usr/bin/ld: CMakeFiles/llama_triton_example.dir/llama_triton_example.cc.o:(.data.rel.ro._ZTVN3MPI3WinE[_ZTVN3MPI3WinE]+0x48): undefined reference to `MPI::Win::Free()'
collect2: error: ld returned 1 exit status
make[2]: *** [examples/cpp/llama/CMakeFiles/llama_triton_example.dir/build.make:132: bin/llama_triton_example] Error 1
make[1]: *** [CMakeFiles/Makefile2:3040: examples/cpp/llama/CMakeFiles/llama_triton_example.dir/all] Error 2
make: *** [Makefile:156: all] Error 2

lvhan028 commented 1 year ago

It supports docker. Can you open an issue in lmdeploy and tell us how to reproduce it?

casper-hansen commented 1 year ago

Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ 😄

model llm-awq mlc-llm lmdeploy Llama-2-7B-chat 112.9 159.4 206.4 Llama-2-13B-chat N/A 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively,

model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0

Are these numbers measured at batch size 1? Also, how to reproduce these numbers?

lvhan028 commented 1 year ago

Aha, I recommend our project LMDeploy, which support 4-bit weight and 8-bit k/v cache quantization and inference. Check the performance on Geforce RTX 4090 out. Totally worth 500$ 😄 model llm-awq mlc-llm lmdeploy Llama-2-7B-chat 112.9 159.4 206.4 Llama-2-13B-chat N/A 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively, model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0

Are these numbers measured at batch size 1? Also, how to reproduce these numbers?

Yes. batch size is 1. You can use benchmark/profile_generation.py to reproduce it

faizanahemad commented 1 year ago

What's the progress here? Do we use LMDeploy or will vllm have support for this?

TomExMachina commented 1 year ago

What's the progress here? Do we use LMDeploy or will vllm have support for this?

It looks like #762 is close to being merged. I was sad to see "currently does not support tensor parallelism" but there is already a fork being worked on for that.

I only just learned of skypilot and started using it today. I think once that fork becomes a PR, AWQ quantized models can be deployed with skypilot as well?

lvhan028 commented 1 year ago

LMDeploy (>=v0.0.6) supports tensor parallelism for 4-bit quantized LLM model inference

casper-hansen commented 1 year ago

LMDeploy (>=v0.0.6) supports tensor parallelism for 4-bit quantized LLM model inference

What's the plan to support more models like MPT or Falcon?

mhillebrand commented 1 year ago

The only models I use are either Llama2-GPTQ or Flan-T5, so I can't use vLLM whatsoever. 😢

casper-hansen commented 1 year ago

The only models I use are either Llama2-GPTQ or Flan-T5, so I can't use vLLM whatsoever. 😢

What do you mean by you can’t use it?

mhillebrand commented 1 year ago

The only models I use are either Llama2-GPTQ or Flan-T5, so I can't use vLLM whatsoever. 😢

What do you mean by you can’t use it?

vLLM doesn't support GPTQ models or T5 models, from what I can tell.

casper-hansen commented 1 year ago

The only models I use are either Llama2-GPTQ or Flan-T5, so I can't use vLLM whatsoever. 😢

What do you mean by you can’t use it?

vLLM doesn't support GPTQ models or T5 models, from what I can tell.

You should use AWQ models either way as they are better.

mhillebrand commented 1 year ago

823 is greater than 47.