It is difficult to run LLM with f32/f16 on pc, To perform inference of LLM on the edge, it is almost necessary to use Q4 quantization. Perhaps Int4 can be used as a built-in type
We can't upload int8 or int4 to the GPU, but @laggui is working on quantization on Burn. We will probably create abstractions making it easier to create quantized kernels
It is difficult to run LLM with f32/f16 on pc, To perform inference of LLM on the edge, it is almost necessary to use Q4 quantization. Perhaps Int4 can be used as a built-in type