szagoruyko / attention-transfer

Improving Convolutional Networks via Attention Transfer (ICLR 2017)
https://arxiv.org/abs/1612.03928
1.43k stars 274 forks source link

Setting of β #32

Open tangbohu opened 5 years ago

tangbohu commented 5 years ago

Hi.

In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we set it to 10^3 divided by number of elements in attention map and batch size for each layer. "

But I am still confused. What is 10^3 mean, and how 0.1 was got?

d-li14 commented 5 years ago

@tangbohu I assume that β is 10^3 / batch_size / (feature_map_size)^2, this division occurs in the average function here in practice, batch size is set to 128 by default, and feature map size varies in the range of 32x32, 16x16, 8x8, so the aformentioned equation varies about 0.1. Just my own conjecture from the code.