Open fzyzcjy opened 3 weeks ago
Oh the CE Loss weighting is now fixed in the gradient accumulation bug fix! Ie there are no more weighting factors at all! Ie the correct loss is actually number 2 - we sum and not divide by the length of the sequence.
The FA2 packing method does need changes - specifically the RoPE kernel must accept indices
I see, thank you!
Hi thanks for the lib! The https://huggingface.co/blog/packing-with-FA2 new feature looks great and can further speed up training using this repo. However, it has some problem: if I understand correctly, the cross entropy loss will be different.
For example, suppose we have two sequences
[a,b,c,d,e,f,g]
and[x,y]
. Then, the original way will give:while the new one -
[a,b,c,d,e,f,g,x,y]
will beAs can be seen, for example, the "z" is originally weighted 1/4, but now weighted 1/9.
Thus, the proposal is to add support for this (correctly weight every token). Then, we can happily enable this feature (by default), and unsloth can be even faster by wasting less tokens!