princeton-nlp / CEPE

[ACL 2024] Long-Context Language Modeling with Parallel Encodings
https://arxiv.org/abs/2402.16617
MIT License
118 stars 8 forks source link

Seeking help with the loss curve #2

Closed 311dada closed 1 month ago

311dada commented 3 months ago

Congratulations on your excellent work. Intuitively, introducing new parameters for cross-attention may lead to a high loss. Could you please share your loss curve? Thanks a lot!

howard-yen commented 2 months ago

Hi, apologies for the late reply. You can check this wandb link for the evaluation loss curve during training. Your intuition is correct --- we also found that simply introducing new parameters in the form of cross-attention results in higher loss and training instability. Therefore, we use carefully weight initialization using the self-attention weights, and also the warmup stage (described in section 2.3 and appendix A.2 in the paper).

Although I do not have the loss curves for the warmup training stage plotted on wandb, it looks quite similar to the one linked. The main difference is that the loss starts up higher, but due to the easy learning objective, the loss quickly converges to near 0 after just a couple thousands of steps. Please let me know if you have any other questions!

311dada commented 1 month ago

Thanks a lot!