Closed Levstyle closed 7 years ago
Hi @Levstyle thanks for putting this up here! You're right about the dimensions. It looks like right now the cost function is as good as not having a mask. I will fix this and report the resulting improvement.
def cross_entropy_with_sequence_mask(output, target):
cross_entropy = target * tf.log(output + 1e-8) # B x 2 x N
cross_entropy = -tf.reduce_sum(cross_entropy, 2) # B x 2
mask = tf.sign(tf.reduce_max(tf.abs(target), 2)) # B x 2
cross_entropy *= mask # B x 2
cross_entropy = tf.reduce_sum(cross_entropy, 1) # B
cross_entropy /= tf.reduce_sum(mask, 1) # B
return tf.reduce_mean(cross_entropy) # 1
I don't understand the reason using mask. BTW, I guess that tf.tf.losses.softmax_cross_entropy
can replace the above function as follows:
cross_entropy = tf.losses.softmax_cross_entropy(tf.reshape(target, (-1, N, ...)
But I'm not sure how does mask
work. I will appreciate your explanation.
@huangpeng1126 I actually noticed this mistake but didn't have time to commit a fix. Because we are already multiplying the logits with the label which is essentially just a one hot vector, we don't need masking at all. I will be removing the mask function as we don't want to mask cross entropy but rather the logits output by the pointer network before softmax (see here).
But for practical purpose, the current code for cross_entropy_with_sequence_mask function works perfectly fine, since it is just summing up across passage_length dimension, and multiplying by mask which doesn't make a difference at all, and then summing again across the time dimension of the pointer network and dividing it by 2 which is identical to taking the average of cross entropy. Please feel free to correct me if I'm wrong.
Also the reason I can't use tf.tf.losses.softmax_cross_entropy
is because I'm already applying softmax at pointer network before outputting the logits. (Check here and here)
Fixed as of 9997f7506d8c9bfe27ea49a817591e7de30b31a8. Thanks!
@minsangkim142
but rather the logits output by the pointer network before softmax
That is great, and it seems better idea.
Also the reason I can't use tf.tf.losses.softmax_cross_entropy is because I'm already applying softmax at pointer network before outputting the logits.
Thank for your explanation for that, you are right, softmax_cross_entropy
is wrong with softmax inputs.
BTW, your code is very pretty, and I learned lots of coding tricks
when I read the code above, I find it's hard to understand the function of mask.
to my knowledge, the shape of output and target are both
(B, 2, N)
, it means we have B samples, and have to predict the starting position and the end position for each sample, every position has N options.but the actual lengths of most samples are less than N. So we have to mask the extra positions out. As stated,the mask should work at axis 2, not axis 1.