intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library
Apache License 2.0
452 stars 46 forks source link

Add support for BitNet b1.58 quantization #3

Open vegax87 opened 7 months ago

vegax87 commented 7 months ago

Is your feature request related to a problem? Please describe. Currently 8-bit and 4-bit are de facto standard quantization algorithms, but I would like to have the implementation of BitNet b1.58 algorithm which improves training speed, inference speed and maintains accuracy of FP16 values by rounding every weight to ternary values (-1, 0, +1)

Describe the solution you'd like add BitNet b1.58 quantization in the library

Describe alternatives you've considered There are no alternatives as far as I know, it's a novel quantization algorithm

Additional context Original paper: https://arxiv.org/pdf/2402.17764.pdf

alessandropalla commented 7 months ago

I agree. I expect this to have big impact as LLMs generation is bandwidth bound so smaller weight size will translate in better performance. This feature requires driver updates to be implemented I'll update this ticket once a compatible driver is available

mushuanli commented 6 months ago

I try mistral run in NPU (155H) vs run in ollama, the ollama version is better than NPU version. I think this should be it is smaller, so it can read the memory more faster. support quantization is a better choice.

alessandropalla commented 6 months ago

I agree, quantization support is really important for performance. Mostly because decoding is DRAM bandwidth bounded and so small weights => small data transfer => better performance (https://intel.github.io/intel-npu-acceleration-library/llm_performance.html) We are currently doing driver work to properly support mixed precision inference in the NPU should come in the next driver releases. Stay tuned ;)

vegax87 commented 6 months ago

Microsoft has published an updated paper with a basic implementation of BitNet 1.58 in Pytorch:

https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf

UPDATE: There's another very interesting article that combines 1-bit/2-bit with Half-Quadratic Quantization:

https://mobiusml.github.io/1bit_blog/