Open LucasWilkinson opened 1 month ago
Thanks for the RFC/doc! A couple questions
pack_factor
is assumed to be an integer in the documentation, but this is kind of not true for, say, 3-bit.@HanGuo97 thanks for reading it through
I'm somewhat unclear about the separation of kernel implementation vs checkpoint format. Does this mean the kernel will have to work with a "universal" data format? Or, this separation is simply a way to implement the dispatch logic? (Based on what I understood, different kernels usually have somewhat different packing format.)
Yes the goal here would be to make "repacking" of weights more of a first class citizen. The process_weights_after_loading
(purposing that this gets renamed to prep_weights_for_execution
) would be were the kernels can repack the weights into whatever layout they want. Since the source layout would be the checkpoint layout not all kernels will be able to interpret the layout in-order to repack it, so in this RFC I purpose layouts to have a to_standard_layout
function which can force the layout into a known "standard" layout (if possible). This way the minimal repacking implementations a kernel would have to write is from the "standard" layout to its desired layout. If the kernel recognizes the layout I doesnt need to force to to_standard_layout
if it doesnt want to and can just use it as is, i.e. if the marlin kernel receives something already in the marlin layout it can just skip going to the standard layout and back to the marlin layout.
Does this RFC still allows arbitrary packed data format?
Yes we will maintain the "legacy" pathway (1:1 mapping between QuantizationConfig
and QuantLinearMethod
, i.e. allow someone to not use the CompressedLinearDescriptor
dispatching but instead just return a pre-constructed QuantLinearMethod
directly from QuantizationConfig.create_quant_method
(rename of QuantizationConfig.get_quant_method
). This can be used for more bespoke packing formats like gguf or more "reasearchy" code like AQLM.
Somewhat minor. The pack_factor is assumed to be an integer in the documentation, but this is kind of not true for, say, 3-bit.
Good catch! I had updated it to Fraction
in some locations but did a scrub to make sure it is updated to be a Fraction
now in all :+1:
How would this take care methods that quantize the model after loading the checkpoint (IIRC, BNB's usage in vLLM does this).
Good point, im not sure if I see BNB doing this (if you don't mind pointing me to it id appreciate it). But it looks like DeepSpeedFPLinearMethod
does it, I think this logic would stay relatively the same, i.e. use an overload "weight_loader" (here), just with the RFC since QuantizationConfig
is responsible for create_weights
it would be registering the weight_loader not QuantLinearMethod
. I do think we could come up with something more flexible for this too, although I think thats somewhat orthogonal to this RFC. Very open to suggestions though.
Motivation.
Currently vLLM generally has a tight coupling between the checkpoint format and the kernel used during model execution. This model causes issues as the diversity of hardware and kernels increases. This is particularly challenging for quantized kernels (mixed-precision with subbyte weights in particular). For performance, quantized kernels will frequently want to run hardware specialized kernels and for mixed-input commonly pre-pack the weights into a bespoke layout that closely matches the hardware it's running on.
The goal is to separate the kernel implementation from checkpoint format; this will require a more sophisticated way of describing the linear layer operation in addition to a more sophisticated way of describing packed layouts within vLLM. The result will hopefully make it easier to register a kernel as a backend for multiple checkpoint formats. It will also require standardizing the calling structure of quantized linear layers in vLLM.
Proposed Change.
The high level proposal is to separate out the
create_weights
logic, moving it intoQuantizationConfig
fromQuantizeMethodBase
, asQuantizationConfig
is more closely tied to the serialization format. Then to create aCompressedLinearDescriptor
to allow theQuantizationConfig
to describe the computation that needs to take place allow for a kernel dispatcher to select the most appropriate kernel (thatcan_implement
the computation).More details: https://docs.google.com/document/d/1AfgGfF73H_hcXfw6ehYO_l1vHEItsopbxFoV1PvnGIQ/edit?usp=sharing
Feedback Period.
Until Oct 7. , will begin preparatory work to help demonstrate before that
CC List.
@dsikka @mgoin @robertgshaw2-neuralmagic @comaniac @alexm-neuralmagic @HanGuo97 @tlrmchlsmth @bnellnm
Any Other Things.
No response
Before submitting a new issue...