pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
895 stars 105 forks source link

MoE example #729

Open msaroufim opened 1 month ago

msaroufim commented 1 month ago

@felipemello1 shared this PR with me https://github.com/vllm-project/vllm/pull/7415

My sense is we should already be able to support this with

from torchao.quantization.quant_api import quantize_, int4_weight_only
quantize_(m, int8_weight_only())

Would just need a good example to showcase this

Also cc @jcaip and @cpuhrsch who've thought a lot more about MoE than me

felipemello1 commented 1 month ago

for context, it was used in the new jamba model: https://x.com/yampeleg/status/1826617129363239143?s=46