There may be some problem if loss and gradient are computed with different samples. Maybe it is better to use a mask to store the selected samples as dropout layer does.
Besides, it is a little confused that the loss computed in forward is not used in backward computation.
I see the point. The code is correct.
It seems that the loss computed in forward is only a reference value and it is not related to backward, thus it is okay to compute gradients with other samples in backward.
In
mmd
layer, random sampling is used in both forward and backward computation.https://github.com/thuml/Xlearn/blob/master/caffe/src/caffe/layers/mmd_layer.cu#L85-91 https://github.com/thuml/Xlearn/blob/master/caffe/src/caffe/layers/mmd_layer.cu#L144-150
There may be some problem if loss and gradient are computed with different samples. Maybe it is better to use a mask to store the selected samples as dropout layer does.
Besides, it is a little confused that the loss computed in forward is not used in backward computation.