Open AndreaChiChengdu opened 2 months ago
Both the inference kernel and the convert script already supported mixed precision quantization by detecting bits of each layer. However, I don't know if there are any tools to generate a mixed precision GPTQ model. If there is such a model, T-MAC can support it.
Both the inference kernel and the convert script already supported mixed precision quantization by detecting bits of each layer. However, I don't know if there are any tools to generate a mixed precision GPTQ model. If there is such a model, T-MAC can support it.
i have see t-mac tune kernel on shape 、bits and so on; compiled llama.cpp kernel only support one bit and net; how to support mixed network?(forexample: tuned and compilered llama.cpp can run 2bit bitnet, but they run other bits or network(4bits bitenet, 2bits llama2) will meet error)
Just like the model weight contains I2, I3, I4 quantization type I checked the documentation and script, and it seems that it is not supported yet? thanks!