huggingface / ratchet

A cross-platform browser ML framework.
https://ratchet.sh
MIT License
616 stars 33 forks source link

4 bit quantization support #260

Open bil-ash opened 4 weeks ago

bil-ash commented 4 weeks ago

I would like to use this library for in-browser web ml inference because with the upcoming CPU support it is better than

  1. ggml.cpp(llama.cpp/whisper.cpp) - as it supports both CPU and GPU and can use GPU on devices where WebGPU is available thereby providing better performance
  2. web-llm(which is WEBGPU only) - as it (will) have a CPU backend thereby allowing inference on devices where WEBGPU is not supported(many android browsers)
  3. onnx - it is ligter than onnx

However, all 3 of them support 4 bit quantization whereas (apparently) ratchet only supports 8 bit quantization. 4-bit quantization is very much required because without that, it is impossible to run whisper-v3-turbo and llama-3.2-1b on browser with limited RAM. So, please support 4bit quantization soon.

FL33TW00D commented 4 weeks ago

Hey @bil-ash, Thanks for raising these points.

We have done some work on 4 bit quantization here, but it's not completed.

CPU + 4 bit are very important to us, stay tuned.