vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.44k stars 4.6k forks source link

[RFC]: QuantizationConfig and QuantizeMethodBase Refactor for Simplifying Kernel Integrations #8913

Open LucasWilkinson opened 1 month ago

LucasWilkinson commented 1 month ago

Motivation.

Currently vLLM generally has a tight coupling between the checkpoint format and the kernel used during model execution. This model causes issues as the diversity of hardware and kernels increases. This is particularly challenging for quantized kernels (mixed-precision with subbyte weights in particular). For performance, quantized kernels will frequently want to run hardware specialized kernels and for mixed-input commonly pre-pack the weights into a bespoke layout that closely matches the hardware it's running on.

The goal is to separate the kernel implementation from checkpoint format; this will require a more sophisticated way of describing the linear layer operation in addition to a more sophisticated way of describing packed layouts within vLLM. The result will hopefully make it easier to register a kernel as a backend for multiple checkpoint formats. It will also require standardizing the calling structure of quantized linear layers in vLLM.

Proposed Change.

The high level proposal is to separate out the create_weights logic, moving it into QuantizationConfig from QuantizeMethodBase, as QuantizationConfig is more closely tied to the serialization format. Then to create a CompressedLinearDescriptor to allow the QuantizationConfig to describe the computation that needs to take place allow for a kernel dispatcher to select the most appropriate kernel (that can_implement the computation).

More details: https://docs.google.com/document/d/1AfgGfF73H_hcXfw6ehYO_l1vHEItsopbxFoV1PvnGIQ/edit?usp=sharing

Feedback Period.

Until Oct 7. , will begin preparatory work to help demonstrate before that

CC List.

@dsikka @mgoin @robertgshaw2-neuralmagic @comaniac @alexm-neuralmagic @HanGuo97 @tlrmchlsmth @bnellnm

Any Other Things.

No response

Before submitting a new issue...

HanGuo97 commented 1 month ago

Thanks for the RFC/doc! A couple questions

LucasWilkinson commented 1 month ago

@HanGuo97 thanks for reading it through

I'm somewhat unclear about the separation of kernel implementation vs checkpoint format. Does this mean the kernel will have to work with a "universal" data format? Or, this separation is simply a way to implement the dispatch logic? (Based on what I understood, different kernels usually have somewhat different packing format.)

Yes the goal here would be to make "repacking" of weights more of a first class citizen. The process_weights_after_loading (purposing that this gets renamed to prep_weights_for_execution) would be were the kernels can repack the weights into whatever layout they want. Since the source layout would be the checkpoint layout not all kernels will be able to interpret the layout in-order to repack it, so in this RFC I purpose layouts to have a to_standard_layout function which can force the layout into a known "standard" layout (if possible). This way the minimal repacking implementations a kernel would have to write is from the "standard" layout to its desired layout. If the kernel recognizes the layout I doesnt need to force to to_standard_layout if it doesnt want to and can just use it as is, i.e. if the marlin kernel receives something already in the marlin layout it can just skip going to the standard layout and back to the marlin layout.

Does this RFC still allows arbitrary packed data format?

Yes we will maintain the "legacy" pathway (1:1 mapping between QuantizationConfig and QuantLinearMethod, i.e. allow someone to not use the CompressedLinearDescriptor dispatching but instead just return a pre-constructed QuantLinearMethod directly from QuantizationConfig.create_quant_method (rename of QuantizationConfig.get_quant_method). This can be used for more bespoke packing formats like gguf or more "reasearchy" code like AQLM.

Somewhat minor. The pack_factor is assumed to be an integer in the documentation, but this is kind of not true for, say, 3-bit.

Good catch! I had updated it to Fraction in some locations but did a scrub to make sure it is updated to be a Fraction now in all :+1:

How would this take care methods that quantize the model after loading the checkpoint (IIRC, BNB's usage in vLLM does this).

Good point, im not sure if I see BNB doing this (if you don't mind pointing me to it id appreciate it). But it looks like DeepSpeedFPLinearMethod does it, I think this logic would stay relatively the same, i.e. use an overload "weight_loader" (here), just with the RFC since QuantizationConfig is responsible for create_weights it would be registering the weight_loader not QuantLinearMethod. I do think we could come up with something more flexible for this too, although I think thats somewhat orthogonal to this RFC. Very open to suggestions though.