Logits shift in loss computation

jzhang38 / EasyContext

Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.

Apache License 2.0

650 stars 47 forks source link

Open shivamag125 opened 4 months ago

shivamag125 commented 4 months ago

While the computing the loss L136, shouldn't the logits and targets be rolled to account for next token prediction?

shivamag125 commented 4 months ago

Edit- I see that you took care of it while preparing the targets.