Open vegax87 opened 9 months ago
I agree. I expect this to have big impact as LLMs generation is bandwidth bound so smaller weight size will translate in better performance. This feature requires driver updates to be implemented I'll update this ticket once a compatible driver is available
I try mistral run in NPU (155H) vs run in ollama, the ollama version is better than NPU version. I think this should be it is smaller, so it can read the memory more faster. support quantization is a better choice.
I agree, quantization support is really important for performance. Mostly because decoding is DRAM bandwidth bounded and so small weights => small data transfer => better performance (https://intel.github.io/intel-npu-acceleration-library/llm_performance.html) We are currently doing driver work to properly support mixed precision inference in the NPU should come in the next driver releases. Stay tuned ;)
Microsoft has published an updated paper with a basic implementation of BitNet 1.58 in Pytorch:
UPDATE: There's another very interesting article that combines 1-bit/2-bit with Half-Quadratic Quantization:
Is your feature request related to a problem? Please describe. Currently 8-bit and 4-bit are de facto standard quantization algorithms, but I would like to have the implementation of BitNet b1.58 algorithm which improves training speed, inference speed and maintains accuracy of FP16 values by rounding every weight to ternary values (-1, 0, +1)
Describe the solution you'd like add BitNet b1.58 quantization in the library
Describe alternatives you've considered There are no alternatives as far as I know, it's a novel quantization algorithm
Additional context Original paper: https://arxiv.org/pdf/2402.17764.pdf