kyegomez / AttentionIsOFFByOne

Implementation of "Attention Is Off By One" by Evan Miller
MIT License
179 stars 9 forks source link

how to solve the issue of overflow #2

Open ZGCTroy opened 1 year ago

ZGCTroy commented 1 year ago

when x_i is large, torch.exp(x_i) will overflow.

In your implementation, x_i = x_i - x_max. So the softmax one equation should be exp(x_i - x_max) / (1 + sum(exp(x_i - x_max)) ?

Upvote & Fund

Fund with Polar