Closed shaibagon closed 6 years ago
Hi Shai,
Thanks for the interests.
Our cgnl method is based on the trilinear formulation, so we choose the dot production kernel for the functions of θ and φ.
In the Section.3.2 of the original nl paper, there are four types of kernel function, which are gaussian, embedded gaussian, dot product, and concatenation. The embedded gaussian kernel uses the softmax to activate the self-attention matrix. The dot product function doesn't need the softmax to play the norm, it uses a constant factor C(x) (refer to the formula.(4) in nl paper). So in the nl code and our cgnl code, both using the norm is a choice.
In our code, we use the number of positions in feature \x to norm the result as same as the nl code.
In fact, we found the top1 accuracy without using scale-norm for the dot production kernel is slightly better than that of using it on the CUB-200 dataset. So in default, the flag of scale-norm is False in our code for all the experiments. But whether using it or not depends on the dataset and the task.
We want to keep the high-order feature spaces in Taylor expansion to enrich the feature representation, so the l2 norm is not used here.
Under the above precondition, the β should be calculated by β = exp(−γ(∥θ∥^2 +∥φ∥^2)) . But in the experiments, we found training becomes very difficult. So we simplify the implementation to ease the gradient computation through calculating the β = exp(−2γ).
@KaiyuYue thank you for the detailed answer.
Hi, I find your work very interesting. However I have two questions regarding normalization:
In the original non-local neural networks work, the product of phi and theta is normalized BEFORE it multiplies g to produce the output (in their work it is done using a softmax layer). I do not see any such normalization in your work - why?
Your Taylor expansion is based on the assumption that both theta and phi are of unit L2 norm. I do not see this enforced in your code - what have I missed?
Thanks,