First of all, thanks for your work on fixing gradient accumulation! I have a question about implementation in unsloth-zoo here. In a blog post https://unsloth.ai/blog/gradient you say that
This means naively averaging over each gradient accumulation step is wrong, but instead we must derive the denominator beforehand.
Hi,
First of all, thanks for your work on fixing gradient accumulation! I have a question about implementation in unsloth-zoo here. In a blog post https://unsloth.ai/blog/gradient you say that
But checking your code implementation, I can see that you simply add up losses, but denominator is commented https://github.com/unslothai/unsloth-zoo/blob/7b0048e53a6239bdad76cad66bf2490f6a2f8a9b/unsloth_zoo/training_utils.py#L268-L270
shouldn't loss be multiplied by denominator here to match an "After - Unsloth fix" graph?