Open ivorelectra opened 1 month ago
Hi,
Thanks for your interest in our work.
I tested Flash Attention with the same training setting of normal attention. And I do observe the performance drop with flash attention.
I would recommend to use the latest version of pytorch, where flash attention is ensembled in the attention layer.
Hello, and thank you for your great work!
I have a question regarding the configuration when enabling Flash Attention. Specifically, should settings like the learning rate or batch size in the config file be adjusted when Flash Attention is used?
Additionally, I have noticed that when Flash Attention is enabled, I occasionally observe a grid-like pattern during the training phase. I am curious if you might know the reason for this phenomenon. I have seen this grid-like view appear under different parameter configurations as well.
I appreciate your insights and look forward to your response!