jzhang38 / EasyContext

Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
Apache License 2.0
581 stars 36 forks source link

Logits shift in loss computation #39

Open shivamag125 opened 1 month ago

shivamag125 commented 1 month ago

While the computing the loss L136, shouldn't the logits and targets be rolled to account for next token prediction?

Similar to https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1092

shivamag125 commented 1 month ago

Edit- I see that you took care of it while preparing the targets.