Open casper-hansen opened 6 months ago
@casper-hansen Thanks for bringing this up! We will reach out to the IBM folks and ask whether they are interested in the integration.
Amazing work by @AdnanHoque @ani300 @cyang49 from IBM Research! They will be contributing this back to vLLM soon!
Great work on the MoE kernel Woosuk! Learned a lot studying it :) Yes hoping to contribute back soon!
We've also been working on fusing dequantization into the MoE kernel to allow for a W4A16 GPTQ path, would be great to collaborate on that front if it's something you've been looking to add!
W4A16 Fused MoE: https://github.com/pytorch-labs/applied-ai/blob/main/kernels/triton/inference/gptq/mixtral/w4a16_fused_dequant_gemm.py
https://github.com/vllm-project/vllm/pull/3905
there appear to be some correctness issues with these kernels that we need to debug
🚀 The feature, motivation and pitch
vLLM should adopt the PyTorch Labs Triton kernels that were developed recently since they can yield 2.7x faster for bs=2 and 4.4x for bs=512. @WoosukKwon Note that the new kernels were developed on top of the existing vLLM kernels, so they should be highly compatible.
Blog: https://pytorch.org/blog/accelerating-moe-model/?utm_content=288416924 Code: https://github.com/pytorch-labs/applied-ai/tree/main/kernels/triton/inference/col_major_moe_gemm
Alternatives
No response
Additional context
No response