Open okpatil4u opened 1 year ago
We are currently working on a GPU backend based on wgpu, quantization is on our roadmap.
Related to this, I would also love to see support for i8 , i16 , i32 , i64 quantization for inference, i.e. to run models on embedded MCUs without a dedicated FPU (i.e. esp32c3, esp32c6) with the no_std NdArray backend.
Feature description
Llama.cpp has gained traction as it enables inferring models at 2,3,4,5,6,8,16 and 32 bit precision. Would it be possible to enable inference level quantisation capabilities into burn ?
Feature motivation
Faster inference empowers deployment at edge (either web or laptops)