Optimize Int8 Woq for CPU

pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

BSD 3-Clause "New" or "Revised" License

5.56k stars 505 forks source link

Optimize Int8 Woq for CPU #161

Open yanbing-j opened 5 months ago

yanbing-j commented 5 months ago

This PR is to optimize Int8 Woq both in gpt-fast and mixtral-moe.

At the current stage, we use torch.ops.aten._weight_int8pack_mm as an workaround. And this workaround will be removed when https://github.com/pytorch/pytorch/pull/120985 is merged in PyTorch stable release. Meanwhile, update int8 weight dimension according to torch.ops.aten._weight_int8pack_mm in https://github.com/pytorch/pytorch/pull/118056 and add CPU profiling.

yanbing-j commented 5 months ago

@HDCharles could you please take a look? Thanks!

yanbing-j commented 4 months ago

Hi @yanboliang , could you please take a look? Thanks!