Open bobbiesbob opened 6 years ago
No, the code is correct, there are two variables mask
and mask_multiplier
that have different semantics of 0 and 1
I got a question about this too, so to clarify:
mask_sum
has the number of possible hands, since it's the sum over columns of mask
.mask_multiplier = (feature_size - mask_sum) / feature_size
is then the number of impossible hands divided by the number of total hands.dloss_doutput
, are divided by mask_multiplier
, so the adjusted gradient is the original gradient multiplied by the number of total hands, divided by the number of impossible hands.Why is the gradient not divided by the number of possible hands?
@dmorrill10 I have excatly the same question as you, did you figure it out?
@JaysenStark no, not yet.
@dmorrill10 @lifrordi I think the correct loss should be loss = avg(sum( |pred_i- actual_i| )/mask_i). That is, each sample should compute its own average loss, and then compute the avg batch loss.
The implementation is confusing. Because mask_multiplier actually is used to weight batch loss, when using stochastic gradient descent method, this mask_multiplier will change the scale of derivative, then so that it's hard for optimizer such as adam to compute which learning rate should be used in the next iteration.
I got a question about this too, so to clarify:
mask_sum
has the number of possible hands, since it's the sum over columns ofmask
.mask_multiplier = (feature_size - mask_sum) / feature_size
is then the number of impossible hands divided by the number of total hands.- The loss gradients,
dloss_doutput
, are divided bymask_multiplier
, so the adjusted gradient is the original gradient multiplied by the number of total hands, divided by the number of impossible hands.Why is the gradient not divided by the number of possible hands?
The gradient doesn't need to divided by the number of possible hands because the loss is already regularized by this number. According to the theory of derivative, the gradient is computed by the regularized loss.
in line 57 of masked_huber_loss.lua, it says 1 is for impossible features.
it is actually 0 for impossible features.
So line 65 should actually be
(batch_size * feature_size) / self.mask_sum:sum()
Lines 58-60 should also be changed