Closed pGit1 closed 6 years ago
Because it's an excitation mechanism, not a Softmax attention mechanism.
Softmax ensures everything sums up to 1. This means that few of the filters will be non-sparse, whereas the remaining majority of the filters will become sparse (close to 0 values). This is redundant and waste of memory.
Excitation multiplies it to scale by a 0-1 range for all of the filters independently. Therefore, some filters will be close to 0, but most will lie into 0.1-1.0 range of excitation as per what it learns so there is no sparsity and all filters can equally contribute to the next layer.
Awesome this makes perfect sense! Thanks for the elaborate explanation.
Do you think excitation mechanisms can supplant attention mechanisms are their uses mutually exclusive? Also I tested switching the sigmoid layer to softmax and performance completely died.
On Sat, Dec 23, 2017 at 12:22 AM, Somshubra Majumdar < notifications@github.com> wrote:
Because it's an excitation mechanism, not a Softmax attention mechanism.
Softmax ensures everything sums up to 1. This means that few of the filters will be non-sparse, whereas the remaining majority of the filters will become sparse (close to 0 values). This is redundant and waste of memory.
Excitation multiplies it to scale by a 0-1 range for all of the filters independently. Therefore, some filters will be close to 0, but most will lie into 0.8-1.0 range of excitation as per what it learns so there is no sparsity and all filters can equally contribute to the next layer.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/titu1994/keras-squeeze-excite-network/issues/7#issuecomment-353708036, or mute the thread https://github.com/notifications/unsubscribe-auth/ANU-SpnNtgtIA-sZR4v3EOAncbgtvYVbks5tDI4PgaJpZM4RLdP4 .
Sigmoid cannot be swapped out for this model block, as you have noticed.
Softmax is appropriate for dot product attention, not elementwise product. I have not heard of any model which incorporated both dot product attention and elementwise gating at the same time.
Any intuition on why use sigmoid instead of softmax layer in SE block?