Open msaroufim opened 1 month ago
@felipemello1 shared this PR with me https://github.com/vllm-project/vllm/pull/7415
My sense is we should already be able to support this with
from torchao.quantization.quant_api import quantize_, int4_weight_only quantize_(m, int8_weight_only())
Would just need a good example to showcase this
Also cc @jcaip and @cpuhrsch who've thought a lot more about MoE than me
for context, it was used in the new jamba model: https://x.com/yampeleg/status/1826617129363239143?s=46
@felipemello1 shared this PR with me https://github.com/vllm-project/vllm/pull/7415
My sense is we should already be able to support this with
Would just need a good example to showcase this
Also cc @jcaip and @cpuhrsch who've thought a lot more about MoE than me