lucidrains / BS-RoFormer

Implementation of Band Split Roformer, SOTA Attention network for music source separation out of ByteDance AI Labs
MIT License
384 stars 13 forks source link

Linear Attention temperature initialization #28

Closed f0k closed 4 months ago

f0k commented 6 months ago

First of all, thanks for all the good work (including in this repo)!

There's a potential bug in LinearAttention that I thought I should bring to your attention: In contrast to the original paper, you learn the logarithmic temperature, but still initialize the parameter with ones instead of zeros. This means the temperature will initially be Euler's number. Maybe it doesn't make much of a difference in practice or works even better, but it looks like it may have been unintentional.

lucidrains commented 4 months ago

@f0k thanks Jan! made the change