Incorrect Masked Huber Loss calculation

lifrordi / DeepStack-Leduc

Example implementation of the DeepStack algorithm for no-limit Leduc poker

https://www.deepstack.ai/

878 stars 211 forks source link

Incorrect Masked Huber Loss calculation #14

Open bobbiesbob opened 6 years ago

bobbiesbob commented 6 years ago

in line 57 of masked_huber_loss.lua, it says 1 is for impossible features.

it is actually 0 for impossible features.

So line 65 should actually be (batch_size * feature_size) / self.mask_sum:sum()

Lines 58-60 should also be changed

lifrordi commented 6 years ago

No, the code is correct, there are two variables mask and mask_multiplier that have different semantics of 0 and 1

dmorrill10 commented 6 years ago

I got a question about this too, so to clarify:

mask_sum has the number of possible hands, since it's the sum over columns of mask.
mask_multiplier = (feature_size - mask_sum) / feature_size is then the number of impossible hands divided by the number of total hands.
The loss gradients, dloss_doutput, are divided by mask_multiplier, so the adjusted gradient is the original gradient multiplied by the number of total hands, divided by the number of impossible hands.

Why is the gradient not divided by the number of possible hands?

JaysenStark commented 6 years ago

@dmorrill10 I have excatly the same question as you, did you figure it out?

dmorrill10 commented 6 years ago

@JaysenStark no, not yet.

KK666-AI commented 5 years ago

@dmorrill10 @lifrordi I think the correct loss should be loss = avg(sum( |pred_i- actual_i| )/mask_i). That is, each sample should compute its own average loss, and then compute the avg batch loss.

The implementation is confusing. Because mask_multiplier actually is used to weight batch loss, when using stochastic gradient descent method, this mask_multiplier will change the scale of derivative, then so that it's hard for optimizer such as adam to compute which learning rate should be used in the next iteration.

KK666-AI commented 5 years ago

I got a question about this too, so to clarify:

mask_sum has the number of possible hands, since it's the sum over columns of mask.

mask_multiplier = (feature_size - mask_sum) / feature_size is then the number of impossible hands divided by the number of total hands.

The loss gradients, dloss_doutput, are divided by mask_multiplier, so the adjusted gradient is the original gradient multiplied by the number of total hands, divided by the number of impossible hands.

Why is the gradient not divided by the number of possible hands?

The gradient doesn't need to divided by the number of possible hands because the loss is already regularized by this number. According to the theory of derivative， the gradient is computed by the regularized loss.