vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.72k stars 3.75k forks source link

[RFC]: A Graph Optimization System in vLLM using torch.compile #6378

Open bnellnm opened 1 month ago

bnellnm commented 1 month ago

Motivation.

At a high level, we at Neural Magic are writing a custom compiler for Torch Dynamo to define a system within vLLM where we can write graph transformations. The main goal is a separation of concerns between high-level model definitions and certain performance-critical low-level decisions. This is especially important for optimizations that are particularly invasive to the model definitions, that break abstractions, that cross boundaries between layers, or that aren't universally valid or useful. If these optimizations are made as part of the model definitions, it becomes much more difficult to add new models.

We are working on the following for an initial set of optimizations using this system, described in detail in the Proposed Passes section.

Although this system operates as a custom compiler inside of Torch Dynamo, it’s best to think of it as an optimization system in vLLM rather than a compiler. Rather than a vertical compiler stack that lowers high-level tensor operations through successive layers of IR, we are taking the simple and pragmatic approach of improving vLLM’s ecosystem of custom kernels rather than replacing it.

Going forward, based on our experience at Neural Magic of what worked well in DeepSparse, we have a perspective on how graph optimizations should fit into vLLM and how it should fit in with the PyTorch team’s plans with torch.compile. In short we think:

[RFC] A Graph Optimization System in vLLM using torch.compile

Proposed Change.

6377

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Chillee commented 1 month ago

Sounds exciting! From my POV on the torch.compile team, I think we'd definitely be very interested in supporting custom passes within Inductor better. This has been something we've been particularly interested in recently, and I suspect that it should be possible (without that much work) to make your lives much easier :)

tlrmchlsmth commented 1 month ago

Sounds exciting! From my POV on the torch.compile team, I think we'd definitely be very interested in supporting custom passes within Inductor better. This has been something we've been particularly interested in recently, and I suspect that it should be possible (without that much work) to make your lives much easier :)

That is great to hear. Ultimately we want this to integrate with Inductor as natively as possible. We'd appreciate whatever help we can get there, and better support for custom passes would be phenomenal.

gx16377 commented 1 month ago

Hi, does it means (or something like) that we wont notice fused_add_rms_norm in model definition (just write rms_norm and residuals) and the graph optimization mechanism will recognize it and call the fused kernel (but not compile it directly into assembles)?

bnellnm commented 3 weeks ago

Hi, does it means (or something like) that we wont notice fused_add_rms_norm in model definition (just write rms_norm and residuals) and the graph optimization mechanism will recognize it and call the fused kernel (but not compile it directly into assembles)?

@gx16377 , we don't do this currently but it is something we are thinking about for future iterations of the optimizer.