titu1994 / keras-squeeze-excite-network

Implementation of Squeeze and Excitation Networks in Keras
MIT License
400 stars 118 forks source link

Intuition on SE Block #7

Closed pGit1 closed 6 years ago

pGit1 commented 6 years ago

Any intuition on why use sigmoid instead of softmax layer in SE block?

titu1994 commented 6 years ago

Because it's an excitation mechanism, not a Softmax attention mechanism.

Softmax ensures everything sums up to 1. This means that few of the filters will be non-sparse, whereas the remaining majority of the filters will become sparse (close to 0 values). This is redundant and waste of memory.

Excitation multiplies it to scale by a 0-1 range for all of the filters independently. Therefore, some filters will be close to 0, but most will lie into 0.1-1.0 range of excitation as per what it learns so there is no sparsity and all filters can equally contribute to the next layer.

pGit1 commented 6 years ago

Awesome this makes perfect sense! Thanks for the elaborate explanation.

Do you think excitation mechanisms can supplant attention mechanisms are their uses mutually exclusive? Also I tested switching the sigmoid layer to softmax and performance completely died.

On Sat, Dec 23, 2017 at 12:22 AM, Somshubra Majumdar < notifications@github.com> wrote:

Because it's an excitation mechanism, not a Softmax attention mechanism.

Softmax ensures everything sums up to 1. This means that few of the filters will be non-sparse, whereas the remaining majority of the filters will become sparse (close to 0 values). This is redundant and waste of memory.

Excitation multiplies it to scale by a 0-1 range for all of the filters independently. Therefore, some filters will be close to 0, but most will lie into 0.8-1.0 range of excitation as per what it learns so there is no sparsity and all filters can equally contribute to the next layer.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/titu1994/keras-squeeze-excite-network/issues/7#issuecomment-353708036, or mute the thread https://github.com/notifications/unsubscribe-auth/ANU-SpnNtgtIA-sZR4v3EOAncbgtvYVbks5tDI4PgaJpZM4RLdP4 .

titu1994 commented 6 years ago

Sigmoid cannot be swapped out for this model block, as you have noticed.

Softmax is appropriate for dot product attention, not elementwise product. I have not heard of any model which incorporated both dot product attention and elementwise gating at the same time.