bnellnm commented 1 month ago

Motivation.

At a high level, we at Neural Magic are writing a custom compiler for Torch Dynamo to define a system within vLLM where we can write graph transformations. The main goal is a separation of concerns between high-level model definitions and certain performance-critical low-level decisions. This is especially important for optimizations that are particularly invasive to the model definitions, that break abstractions, that cross boundaries between layers, or that aren't universally valid or useful. If these optimizations are made as part of the model definitions, it becomes much more difficult to add new models.

We are working on the following for an initial set of optimizations using this system, described in detail in the Proposed Passes section.

Fusing quantize operations onto LayerNorm kernels (both for fp8 and int8 and both static and dynamic quantization)
Fusing the MLP section containing GEMM, SiLU, Mul, and quantize operations
Rewriting Gemm + AllReduce + Layer Norm + Gemm to a Fused Gemm-ReduceScatter + LayerNorm + Fused AllGather Gemm, in order to take advantage of the Flux kernels from ByteDance

Although this system operates as a custom compiler inside of Torch Dynamo, it’s best to think of it as an optimization system in vLLM rather than a compiler. Rather than a vertical compiler stack that lowers high-level tensor operations through successive layers of IR, we are taking the simple and pragmatic approach of improving vLLM’s ecosystem of custom kernels rather than replacing it.

Going forward, based on our experience at Neural Magic of what worked well in DeepSparse, we have a perspective on how graph optimizations should fit into vLLM and how it should fit in with the PyTorch team’s plans with torch.compile. In short we think:

A graph optimization/compilation system can be a power multiplier for vLLM developers.
Torch.compile is not likely to be good enough to replace custom kernels at least for linear layers.
vLLM should not treat torch.compile as a black box.
We should build a system that vLLM developers control that interoperates well with Torch Inductor.
This graph optimization system should be kept lightweight – vLLM should not try to become a graph compiler.

[RFC] A Graph Optimization System in vLLM using torch.compile

Proposed Change.

6377

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Chillee commented 1 month ago

Sounds exciting! From my POV on the torch.compile team, I think we'd definitely be very interested in supporting custom passes within Inductor better. This has been something we've been particularly interested in recently, and I suspect that it should be possible (without that much work) to make your lives much easier :)

tlrmchlsmth commented 1 month ago

Sounds exciting! From my POV on the torch.compile team, I think we'd definitely be very interested in supporting custom passes within Inductor better. This has been something we've been particularly interested in recently, and I suspect that it should be possible (without that much work) to make your lives much easier :)

That is great to hear. Ultimately we want this to integrate with Inductor as natively as possible. We'd appreciate whatever help we can get there, and better support for custom passes would be phenomenal.

gx16377 commented 1 month ago

Hi, does it means (or something like) that we wont notice fused_add_rms_norm in model definition (just write rms_norm and residuals) and the graph optimization mechanism will recognize it and call the fused kernel (but not compile it directly into assembles)?

bnellnm commented 3 weeks ago

Hi, does it means (or something like) that we wont notice fused_add_rms_norm in model definition (just write rms_norm and residuals) and the graph optimization mechanism will recognize it and call the fused kernel (but not compile it directly into assembles)?

@gx16377 , we don't do this currently but it is something we are thinking about for future iterations of the optimizer.

vllm-project / vllm

[RFC]: A Graph Optimization System in vLLM using torch.compile #6378

Motivation.

Proposed Change.

6377

Feedback Period.

CC List.

Any Other Things.