SGD size - Githubissues

yiqingyang2012 commented 7 years ago

Hi brother

i hava a messy below, when training in BiGRUModel.py : loss_t = tf.contrib.seq2seq.sequence_loss( outputs_logits, self.decoder_targets, weights, average_across_timesteps=False, average_across_batch=False) self.loss = tf.reduce_sum(loss_t) / self.batch_size

here loss_t is a matric of shape [batch_size, seqenceLen], why here divide self.batch_size rather than self.batch_size * seqncelen? what is a example represent in this model? a word or a summary?

thank you so much

leix28 commented 7 years ago

Hi,

sum(loss_t) is sum_{x,y} log p(y|x) where x is the document and y is the summary. Then we compute the average over the batch by dividing batch_size.

here loss_t is a matric of shape [batch_size, seqenceLen], why here divide self.batch_size rather than self.batch_size * seqncelen?

We want to optimize the probability of a summary, not the probability of generating each word. Even if we want to do so, it's better to set average_across_timesteps=True rather than dividing (self.batch_size * seqncelen) because of the paddings. Also experiments show that current settings perform better.

what is a example represent in this model? a word or a summary?

A summary.

Thank you!

yiqingyang2012 commented 7 years ago

Thank you for your answer. I found your implement is diff from tensorflow's textsum model which is average over words, but I think your implement is the right one.

thunlp / TensorFlow-Summarization

SGD size #1