Closed yiqingyang2012 closed 7 years ago
Hi,
sum(loss_t)
is sum_{x,y} log p(y|x)
where x is the document and y is the summary. Then we compute the average over the batch by dividing batch_size
.
here loss_t is a matric of shape [batch_size, seqenceLen], why here divide self.batch_size rather than self.batch_size * seqncelen?
We want to optimize the probability of a summary, not the probability of generating each word. Even if we want to do so, it's better to set average_across_timesteps=True
rather than dividing (self.batch_size * seqncelen) because of the paddings. Also experiments show that current settings perform better.
what is a example represent in this model? a word or a summary?
A summary.
Thank you!
Thank you for your answer. I found your implement is diff from tensorflow's textsum model which is average over words, but I think your implement is the right one.
Hi brother
i hava a messy below, when training in BiGRUModel.py : loss_t = tf.contrib.seq2seq.sequence_loss( outputs_logits, self.decoder_targets, weights, average_across_timesteps=False, average_across_batch=False) self.loss = tf.reduce_sum(loss_t) / self.batch_size
here loss_t is a matric of shape [batch_size, seqenceLen], why here divide self.batch_size rather than self.batch_size * seqncelen? what is a example represent in this model? a word or a summary?
thank you so much