Open ebsmothers opened 2 weeks ago
Thought it might be useful to share a bit of context here on the original basis for this - we initially enabled quantize_base
applying only for LoRA modules as we wanted to focus on replicating the original QLoRA implementation in the paper (https://arxiv.org/abs/2305.14314) and planned to follow up generalizing this if it was a popular user request.
Agreed there are no technical blockers in generalizing this, though pointing out that users probably can't directly use the to_nf4
API OOTB to achieve this currently as there's probably some things we do around the state_dict that would also need to be changed.
What about configuring it as we do for lora modules? So we specify lora_modules = [k_proj, v_proj, output_proj]
and can also specify quantized_modules = [...]
in a similar way.
By the way, there is also this config apply_lora_to_output
that might be relevant--I'm not sure how it plays with lora_attn_modules
if output_proj
is specified. In the same way, what's the current mechanism to specify qlora over lora? Will that conflict semantically with the new proposed quantized_modules
?
Maybe apply_lora_to_output
should get merged/controlled by just lora_modules
and quantized_modules
could be consolidated with whatever logic switches lora to qlora today.
@janeyx99 sorry these are confusingly named. apply_lora_to_output
is referring to the final projection back to the vocab_dim
(i.e. the LM head), while output_proj
inside of lora_attn_modules
is just the output projection within each layer's self-attention. So they are actually referring to two different things. We've had some discussions around consolidating all of these args into a single lora_modules
(similar to how it's exposed in PEFT) which may be clearer, but haven't prioritized it yet.
@ebsmothers I would prefer something along the lines of:
quantized_modules: true
to mean quantize everything and then use
quantized_modules = [...]
if a user wants more control. This adds to the validation complexity as I'd expect we would want to ensure that quantized_modules
is a a superset of lora_modules
.
@winglian does this mean quantized_modules
is something like Union[bool, List[str]]
? At least on torchtune we haven't received a ton of requests for really granular definitions of quantization logic. Based on that my assumption is that the bool version of quantized_modules
you gave would suffice for a first version. However, I'd also be interested to know your perspective based on the axolotl community.. have you observed that users would want the additional level of configurability provided by specifying a full list of modules to quantize?
Generally, users haven't asked for that level of granularity. Additionally, most users don't understand how to track down the module names, especially since they may differ from one model architecture to another.
So just to summarize our collective ideas, we want to "modify the default behavior with quantize_base
to quantize all linear layers" and not provide any additional granularity at this point in time.
As pointed out by @janeyx99, our
quantize_base
argument will only quantize the base model weights of linear layers with LoRA applied to them (see e.g. here in our Llama3 self-attention builder). But this is kind of an artificial constraint.. there's no reason we can't just use the sameto_nf4
API we use in LoRA here to quantize othernn.Linear
s to save memory. We can also define e.g. anNF4Linear
class if we want (in fact we previously had such a class but ultimately didn't use it, see #465). We just need to figure out the right way to expose this. We could eitherquantize_base
to quantize all linear layers, orquantize_lora_layers_only
bool or further per-linear configurability similar to what we currently have for our LoRA configs today (cf.). Personally I think the latter is overgeneralizing but open to discussion herecc @joecummings, @winglian