Cross entropy for packing

fzyzcjy commented 3 weeks ago

Hi thanks for the lib! The https://huggingface.co/blog/packing-with-FA2 new feature looks great and can further speed up training using this repo. However, it has some problem: if I understand correctly, the cross entropy loss will be different.

For example, suppose we have two sequences [a,b,c,d,e,f,g] and [x,y]. Then, the original way will give:

loss = 1/2( 1/7(loss(a) + loss(b) + ... + loss(g)) + 1/2(loss(x) + loss(y)) )

while the new one - [a,b,c,d,e,f,g,x,y] will be

loss = 1/9(loss(a) + ... + loss(y))

As can be seen, for example, the "z" is originally weighted 1/4, but now weighted 1/9.

Thus, the proposal is to add support for this (correctly weight every token). Then, we can happily enable this feature (by default), and unsloth can be even faster by wasting less tokens!

danielhanchen commented 3 weeks ago

Oh the CE Loss weighting is now fixed in the gradient accumulation bug fix! Ie there are no more weighting factors at all! Ie the correct loss is actually number 2 - we sum and not divide by the length of the sequence.

The FA2 packing method does need changes - specifically the RoPE kernel must accept indices

fzyzcjy commented 3 weeks ago

I see, thank you!

unslothai / unsloth

Cross entropy for packing #1216