Add CPU support in mixtral-moe for int8 woq

pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

BSD 3-Clause "New" or "Revised" License

5.34k stars 484 forks source link

Add CPU support in mixtral-moe for int8 woq #145

Closed yanbing-j closed 2 months ago

yanbing-j commented 3 months ago

This PR is to add CPU support in mixtral-moe for int8 woq. To improve int8 woq performance, we use torch.ops.aten._weight_int8pack_mm as an workaround. And it will be removed when https://github.com/pytorch/pytorch/pull/120985 is in PyTorch stable release. Meanwhile, update int4 weight dimension, since https://github.com/pytorch/pytorch/pull/117475 has been merged into PyTorch.

yanbing-j commented 3 months ago

Hi @yanboliang @Chillee , could you please help review this PR? Thanks!

yanbing-j commented 3 months ago

Hi @mikekgfb , could you please help review this PR? Thanks!

malfet commented 2 months ago

We really need to add at least some basic CI before merging those changes, as it can break things...