4 bit / 8 bit model training / inference capabilities

tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.

https://burn.dev

Apache License 2.0

8.76k stars 432 forks source link

4 bit / 8 bit model training / inference capabilities #464

Open okpatil4u opened 1 year ago

okpatil4u commented 1 year ago

Feature description

Llama.cpp has gained traction as it enables inferring models at 2,3,4,5,6,8,16 and 32 bit precision. Would it be possible to enable inference level quantisation capabilities into burn ?

Feature motivation

Faster inference empowers deployment at edge (either web or laptops)

nathanielsimard commented 1 year ago

We are currently working on a GPU backend based on wgpu, quantization is on our roadmap.

thedevleon commented 5 months ago

Related to this, I would also love to see support for i8 , i16 , i32 , i64 quantization for inference, i.e. to run models on embedded MCUs without a dedicated FPU (i.e. esp32c3, esp32c6) with the no_std NdArray backend.