Research: 4-bit quantization

neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs

https://neuralmagic.com/deepsparse/

Other

2.97k stars 171 forks source link

Research: 4-bit quantization #1369

Closed truenorth8 closed 4 months ago

truenorth8 commented 10 months ago

Hi.

The paper describes 8-bit quantization combined with pruning, which is fantastic.

My question: has any research been done for 4-bit quantization? Since GPU memory is notoriously expensive, 4-bit quantization would allow running much bigger models (eg 70B models that require low latency, so ran on the GPU).

I'd be happy to contribute if someone could provide implementation guidance.

anmarques commented 9 months ago

Hi @truenorth8. Thanks for your interest in contributing to deepsparse. There's actually been plenty of research into quantizing LLMs to 4 bits. Probably some of the most notorious examples are GPTQ and the subsequent extension SparseGPT. These algorithms are available on the nightly version of our library for sparsification of LLMs SparseML. Here's a link to the main entry point for it: https://github.com/neuralmagic/sparseml/blob/main/src/sparseml/transformers/sparsification/obcq/obcq.py.

Just to clarify, DeepSparse is our inference engine that supports fast execution of sparse and quantized models. DeepSparse at the moment supports CPUs only. However, you can use SparseML to produce compressed models and deploy in any platform.

Fritskee commented 7 months ago

@anmarques will 4-bit Q also come to YOLO models?

mgoin commented 7 months ago

Hi @Fritskee there isn't a great motivation I see for <8bit YOLO models since going to lower precisions simply reduces weight memory usage, which is not a problem for YOLO. When optimization those architectures for performance you want to reduce the size of the large activations (images) or reduce the compute needed to perform the convolutions - this is why for YOLO we apply 8bit quantization and sparsity to reduce compute, see some models here https://sparsezoo.neuralmagic.com/?modelSet=computer_vision&tasks=detection&architectures=yolov8

Fritskee commented 7 months ago

Hi @Fritskee there isn't a great motivation I see for <8bit YOLO models since going to lower precisions simply reduces weight memory usage, which is not a problem for YOLO. When optimization those architectures for performance you want to reduce the size of the large activations (images) or reduce the compute needed to perform the convolutions - this is why for YOLO we apply 8bit quantization and sparsity to reduce compute, see some models here https://sparsezoo.neuralmagic.com/?modelSet=computer_vision&tasks=detection&architectures=yolov8

Thanks for the explanation!

jeanniefinks commented 4 months ago

Hello @Fritskee As there is no further comments here, I am going to go ahead and close out this issue. Feel free to re-open if you would like to continue the conversation. Regards, Jeannie / Neural Magic