Support NF4 quantization of linear layers without LoRA applied

pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning

BSD 3-Clause "New" or "Revised" License

3.53k stars 285 forks source link

Support NF4 quantization of linear layers without LoRA applied #1093

Open ebsmothers opened 2 weeks ago

ebsmothers commented 2 weeks ago

As pointed out by @janeyx99, our quantize_base argument will only quantize the base model weights of linear layers with LoRA applied to them (see e.g. here in our Llama3 self-attention builder). But this is kind of an artificial constraint.. there's no reason we can't just use the same to_nf4 API we use in LoRA here to quantize other nn.Linears to save memory. We can also define e.g. an NF4Linear class if we want (in fact we previously had such a class but ultimately didn't use it, see #465). We just need to figure out the right way to expose this. We could either

modify the default behavior with quantize_base to quantize all linear layers, or
add additional configurability, either e.g. a quantize_lora_layers_only bool or further per-linear configurability similar to what we currently have for our LoRA configs today (cf.). Personally I think the latter is overgeneralizing but open to discussion here

cc @joecummings, @winglian

rohan-varma commented 2 weeks ago

Thought it might be useful to share a bit of context here on the original basis for this - we initially enabled quantize_base applying only for LoRA modules as we wanted to focus on replicating the original QLoRA implementation in the paper (https://arxiv.org/abs/2305.14314) and planned to follow up generalizing this if it was a popular user request.

Agreed there are no technical blockers in generalizing this, though pointing out that users probably can't directly use the to_nf4 API OOTB to achieve this currently as there's probably some things we do around the state_dict that would also need to be changed.

What about configuring it as we do for lora modules? So we specify lora_modules = [k_proj, v_proj, output_proj] and can also specify quantized_modules = [...] in a similar way.

janeyx99 commented 1 week ago

By the way, there is also this config apply_lora_to_output that might be relevant--I'm not sure how it plays with lora_attn_modules if output_proj is specified. In the same way, what's the current mechanism to specify qlora over lora? Will that conflict semantically with the new proposed quantized_modules?

Maybe apply_lora_to_output should get merged/controlled by just lora_modules and quantized_modules could be consolidated with whatever logic switches lora to qlora today.

ebsmothers commented 1 week ago

@janeyx99 sorry these are confusingly named. apply_lora_to_output is referring to the final projection back to the vocab_dim (i.e. the LM head), while output_proj inside of lora_attn_modules is just the output projection within each layer's self-attention. So they are actually referring to two different things. We've had some discussions around consolidating all of these args into a single lora_modules (similar to how it's exposed in PEFT) which may be clearer, but haven't prioritized it yet.

winglian commented 1 week ago

@ebsmothers I would prefer something along the lines of:

quantized_modules: true

to mean quantize everything and then use

quantized_modules = [...]

if a user wants more control. This adds to the validation complexity as I'd expect we would want to ensure that quantized_modules is a a superset of lora_modules.

ebsmothers commented 1 week ago

@winglian does this mean quantized_modules is something like Union[bool, List[str]]? At least on torchtune we haven't received a ton of requests for really granular definitions of quantization logic. Based on that my assumption is that the bool version of quantized_modules you gave would suffice for a first version. However, I'd also be interested to know your perspective based on the axolotl community.. have you observed that users would want the additional level of configurability provided by specifying a full list of modules to quantize?

winglian commented 1 week ago

Generally, users haven't asked for that level of granularity. Additionally, most users don't understand how to track down the module names, especially since they may differ from one model architecture to another.

So just to summarize our collective ideas, we want to "modify the default behavior with quantize_base to quantize all linear layers" and not provide any additional granularity at this point in time.