Closed Leechung closed 5 years ago
Scaling happens in multihead_attention (https://github.com/tensorflow/tensor2tensor/blob/af42d543c2f24a0143b2483db93ac931c54146b9/tensor2tensor/layers/common_attention.py#L3437). Dot product attention is aptly named as it does not use scaling :)
ok, got it, thanks.
Hi, I have been looking at the code in layers.common_attention.dot_product_attention, this seems to be a regular dot product attention without scaling (there is no other attention mechanism that implements the scale dot product attention function, or did i miss it somewhere?) . This seem to be different from the original paper section 3.2.1 Scaled Dot-Product Attention , is there a reason why this was not implement (or was it removed as newer versions was developed)