Open guokan987 opened 7 months ago
Hi, you can refer to this formula:
Thanks authors, but I can't understand why use the follow equation: Training: XW_0+dropout(X)(m/norm(V+deltaV))*(V+deltaV), the effect of is unclear
Note that (mag_norm_scale - 1) * (F.linear(x, transpose(weight, self.fan_in_fan_out))) must be included to properly apply dropout; otherwise, the outcome would be inaccurate. You can refer to https://github.com/huggingface/peft/pull/1474 where we discuss this.
code: result_dora = (mag_norm_scale - 1) (F.linear(x, transpose(weight, self.fan_in_fan_out)) ) + mag_norm_scale lora_B(lora_A(x)) * scaling Question: what is the effect of (mag_norm_scale - 1) and mag_norm_scale ? And, result_dora can't equals the F.linear(x, transpose(weight, self.fan_in_fan_out)) in the Initializing stage due to the parameter "mag_norm_scale - 1"