Scaled Dot-Product Attention

tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Apache License 2.0

15.5k stars 3.49k forks source link

Scaled Dot-Product Attention #1197

Closed Leechung closed 5 years ago

Leechung commented 6 years ago

Hi, I have been looking at the code in layers.common_attention.dot_product_attention, this seems to be a regular dot product attention without scaling (there is no other attention mechanism that implements the scale dot product attention function, or did i miss it somewhere?) . This seem to be different from the original paper section 3.2.1 Scaled Dot-Product Attention , is there a reason why this was not implement (or was it removed as newer versions was developed)

dustinvtran commented 5 years ago

Scaling happens in multihead_attention (https://github.com/tensorflow/tensor2tensor/blob/af42d543c2f24a0143b2483db93ac931c54146b9/tensor2tensor/layers/common_attention.py#L3437). Dot product attention is aptly named as it does not use scaling :)

Leechung commented 5 years ago

ok, got it, thanks.