lucidrains / FLASH-pytorch

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"
MIT License
344 stars 24 forks source link

Laplace Activation Function Implementation #7

Closed boweny-cerebras closed 1 year ago

boweny-cerebras commented 1 year ago

Seems like the implementation on Laplace Activation deviates from what the paper described: In the paper, I think it should write std = 1 / math.sqrt(4 * math.pi) instead of std = math.sqrt(0.25 * math.pi) as the former one is an approximation of relu^2

lucidrains commented 1 year ago

@boweny-cerebras thank you for reporting this! :man_facepalming: