Why KL loss divide training size?

nvcuong / variational-continual-learning

Implementation of the variational continual learning method

Apache License 2.0

188 stars 39 forks source link

Why KL loss divide training size? #2

Closed luluxing3 closed 5 years ago

luluxing3 commented 5 years ago

In ddm/alg/cla_models_multihead.py, line 213 self.cost = tf.div(self._KL_term(), training_size) - self._logpred(self.x, self.y, self.task_idx)
why need divide training_size, which is 60,000 for permuted MINIST task.

nvcuong commented 5 years ago

Hi @luluxing3

By definition, the variational lower bound is the expected log-likelihood minus the KL term. We can equivalently divide the whole term by the constant training_size. This does not change the result and would help the optimizer converge better. Note that the self._logpred(...) has been averaged out over the batch size.

luluxing3 commented 5 years ago

I got it. Thanks for your explain.