vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.95k stars 4.13k forks source link

[Feature]: PyTorch Labs MoE Kernels (2.7-4.4x faster) #3855

Open casper-hansen opened 6 months ago

casper-hansen commented 6 months ago

🚀 The feature, motivation and pitch

vLLM should adopt the PyTorch Labs Triton kernels that were developed recently since they can yield 2.7x faster for bs=2 and 4.4x for bs=512. @WoosukKwon Note that the new kernels were developed on top of the existing vLLM kernels, so they should be highly compatible.

Blog: https://pytorch.org/blog/accelerating-moe-model/?utm_content=288416924 Code: https://github.com/pytorch-labs/applied-ai/tree/main/kernels/triton/inference/col_major_moe_gemm

image

Alternatives

No response

Additional context

No response

WoosukKwon commented 6 months ago

@casper-hansen Thanks for bringing this up! We will reach out to the IBM folks and ask whether they are interested in the integration.

njhill commented 6 months ago

Amazing work by @AdnanHoque @ani300 @cyang49 from IBM Research! They will be contributing this back to vLLM soon!

AdnanHoque commented 6 months ago

Great work on the MoE kernel Woosuk! Learned a lot studying it :) Yes hoping to contribute back soon!

We've also been working on fusing dequantization into the MoE kernel to allow for a W4A16 GPTQ path, would be great to collaborate on that front if it's something you've been looking to add!

W4A16 Fused MoE: https://github.com/pytorch-labs/applied-ai/blob/main/kernels/triton/inference/gptq/mixtral/w4a16_fused_dequant_gemm.py

robertgshaw2-neuralmagic commented 6 months ago

https://github.com/vllm-project/vllm/pull/3905

there appear to be some correctness issues with these kernels that we need to debug