tensorflow / probability

Probabilistic reasoning and statistical analysis in TensorFlow
https://www.tensorflow.org/probability/
Apache License 2.0
4.26k stars 1.1k forks source link

Model Loss struggeling with two model outputs and keras.compile/keras.fit #575

Open alexv1247 opened 5 years ago

alexv1247 commented 5 years ago

tensorflow_probability 0.7.0. tensorflow 1.14.0

I try to build a bayesian neural network like the following. I already know, that the loss has to be scaled in every tfp layer to the total number of train examples by adapting the kernel_divergence_fn to it. However The loss is still not the sum of the two model losses, which leads to an decrease in training performance after 4-5 iterations. I have to add, that I am using the model in an active learning environment, which is why the number of training examples increases for every iteration (consist of 20 epochs). I took care of this, by adding the number of training examples as an input to the model, so this should scale correct.

import numpy as np
from tensorflow import keras
import tensorflow_probability as tfp
import tensorflow as tf
from plot.plot_utils import plot_model_metrics
from Custom_Keras_layers.ProbSqueezeExcite import squeeze_excite_block

The model looks as follows:

inp = keras.layers.Input(shape=[self.timesteps, self.features], name='Inp')
trainSize = keras.layers.Input(shape=[self.timesteps, self.features], name='trainSize')
trainNum = keras.backend.mean(trainSize)
trainNum = keras.backend.mean(trainNum)
# left side
# 1 Conv1D block
l = tfp.layers.Convolution1DReparameterization(filters=2*self.features, kernel_size=2, padding='same', activation=tf.nn.relu,
                                        kernel_divergence_fn=lambda q, p, _: tfp.distributions.kl_divergence(q, p) / trainNum)(inp)
l = keras.layers.BatchNormalization()(l)
if squeeze_excite == 1:
     l = squeeze_excite_block(l)
l = keras.layers.Dropout(dropout_rate)(l)

# 1 Conv1D block
l = tfp.layers.Convolution1DReparameterization(filters=4 * self.features, kernel_size=4, padding='same', activation=tf.nn.relu,
                                        kernel_divergence_fn=lambda q, p, _: tfp.distributions.kl_divergence(q, p) / trainNum, dilation_rate=2)(l)
l = keras.layers.BatchNormalization()(l)
if squeeze_excite == 1:
    l = squeeze_excite_block(l)
l = keras.layers.Dropout(dropout_rate)(l)

# 1 lstm bock
l = keras.layers.LSTM(32, recurrent_dropout=dropout_rate, dropout=dropout_rate)(l, training=True)

# letf output layer
l = tfp.layers.DenseReparameterization(self.classes, activation=tf.nn.softmax, name='left',
                                kernel_divergence_fn=lambda q, p, _: tfp.distributions.kl_divergence(q, p) / trainNum)(l)

# right side
# 1 Conv1D block
r = tfp.layers.Convolution1DReparameterization(filters=2 * self.features, kernel_size=2, padding='same', activation=tf.nn.relu,
                                        kernel_divergence_fn=lambda q, p, _: tfp.distributions.kl_divergence(q, p) / trainNum)(inp)
r = keras.layers.BatchNormalization()(r)
if squeeze_excite == 1:
    r = squeeze_excite_block(r)
r = keras.layers.Dropout(dropout_rate)(r)

# 1 Conv1D block
r = tfp.layers.Convolution1DReparameterization(filters=4 * self.features, kernel_size=4, padding='same', activation=tf.nn.relu,
                                        kernel_divergence_fn=lambda q, p, _: tfp.distributions.kl_divergence(q, p) / trainNum, dilation_rate=2)(r)
r = keras.layers.BatchNormalization()(r)
if squeeze_excite == 1:
    r = squeeze_excite_block(r)
r = keras.layers.Dropout(dropout_rate)(r)

# 1 lstm bock
r = keras.layers.LSTM(32, recurrent_dropout=dropout_rate, dropout=dropout_rate)(r, training=True)

# right output layer
r = tfp.layers.DenseReparameterization(self.classes, activation=tf.nn.softmax, name='right',
                                kernel_divergence_fn=lambda q, p, _: tfp.distributions.kl_divergence(q, p) / trainNum)(r)

self.model = keras.models.Model(inputs=[inp, trainSize], outputs=[l, r])
self.model.compile(optimizer='adam', loss=self._neg_log_likelihood_bayesian(kl=sum(self.model.losses)),
                       metrics=['accuracy'])

The loss function is adapted from the tensorflow probability website

def _neg_log_likelihood_bayesian(self, kl):
    def lossFunction(y_true, y_pred, kl=kl):
        neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_true, logits=y_pred)
        loss = neg_log_likelihood + kl
        return loss
    return lossFunction

The model is training at the beginning and is really fast in getting to a very good performance (4-5 iterations). But at the same time the overall loss does not seem to be the exact sum of the two model outputs losses

675/675 [==============================] - 0s 301us/sample - loss: 64.2562 - left_loss: 32.5349 - right_loss: 32.5441 - left_acc: 0.9289 - right_acc: 0.9200

the sum of both outputs is 65,079 --> != 64,2562 After 16 iterations it looks like follows

2336/2336 [==============================] - 1s 281us/sample - loss: 1.3558 - left_loss: 1.1477 - right_loss: 1.1368 - left_acc: 0.8104 - right_acc: 0.8249

The overall loss is not near to be the sum of both losses here ... Furthermore the performance is much worse with more training, although both losses are decreasing all the time as well.

I feel like, that the overall loss is just the kl, and the output losses are kl+neg_log_likelihood, which leads to the neg_log_likelihood being not considered? The kl loss is always decreasing, but the neg_log_likelihood increases after some time, because the model is not trained correct?

What do I have to do to solve this problem? or is it a bug?

brianwa84 commented 5 years ago

I don't quite understand the model. If you just want twice as many outputs why not ask for twice as many and forego the two separate stacks?

What is the shape of inp, l, r, and your training xs and labels?

Seems like the nll would explain the difference between the kl losses and the total? What is the nll?

Also be careful that the nll/kl are both scalar otherwise one will be broadcast and count twice.

On Sat, Sep 28, 2019, 5:30 PM alexv1247 notifications@github.com wrote:

I try to build a bayesian neural network like the following. I already know, that the loss has to be scaled in every tfp layer to the total number of train examples by adapting the kernel_divergence_fn to it. However The loss is still not the sum of the two model losses, which leads to an decrease in training performance after 4-5 iterations. I have to add, that I am using the model in an active learning environment, which is why the number of training examples increases for every iteration (consist of 20 epochs). I took care of this, by adding the number of training examples as an input to the model, so this should scale correct.

import numpy as np from tensorflow import keras import tensorflow_probability as tfp import tensorflow as tf from plot.plot_utils import plot_model_metrics from Custom_Keras_layers.ProbSqueezeExcite import squeeze_excite_block

The model looks as follows:

inp = keras.layers.Input(shape=[self.timesteps, self.features], name='Inp') trainSize = keras.layers.Input(shape=[self.timesteps, self.features], name='trainSize') trainNum = keras.backend.mean(trainSize) trainNum = keras.backend.mean(trainNum)

left side

1 Conv1D block

l = tfp.layers.Convolution1DReparameterization(filters=2*self.features, kernel_size=2, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(inp) l = keras.layers.BatchNormalization()(l) if squeeze_excite == 1: l = squeeze_excite_block(l) l = keras.layers.Dropout(dropout_rate)(l)

1 Conv1D block

l = tfp.layers.Convolution1DReparameterization(filters=4 * self.features, kernel_size=4, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum, dilation_rate=2)(l) l = keras.layers.BatchNormalization()(l) if squeeze_excite == 1: l = squeeze_excite_block(l) l = keras.layers.Dropout(dropout_rate)(l)

1 lstm bock

l = keras.layers.LSTM(32, recurrent_dropout=dropout_rate, dropout=dropout_rate)(l, training=True)

letf output layer

l = tfp.layers.DenseReparameterization(self.classes, activation=tf.nn.softmax, name='left', kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(l)

right side

1 Conv1D block

r = tfp.layers.Convolution1DReparameterization(filters=2 * self.features, kernel_size=2, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(inp) r = keras.layers.BatchNormalization()(r) if squeeze_excite == 1: r = squeeze_excite_block(r) r = keras.layers.Dropout(dropout_rate)(r)

1 Conv1D block

r = tfp.layers.Convolution1DReparameterization(filters=4 * self.features, kernel_size=4, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum, dilation_rate=2)(r) r = keras.layers.BatchNormalization()(r) if squeeze_excite == 1: r = squeeze_excite_block(r) r = keras.layers.Dropout(dropout_rate)(r)

1 lstm bock

r = keras.layers.LSTM(32, recurrent_dropout=dropout_rate, dropout=dropout_rate)(r, training=True)

right output layer

r = tfp.layers.DenseReparameterization(self.classes, activation=tf.nn.softmax, name='right', kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(r)

self.model = keras.models.Model(inputs=[inp, trainSize], outputs=[l, r]) leftLosses = self.model.losses[0] + self.model.losses[2] + self.model.losses[4] rightLosses = self.model.losses[1] + self.model.losses[3] + self.model.losses[5] self.model.compile(optimizer='adam', loss=self._neg_log_likelihood_bayesian(kl=sum(self.model.losses)), metrics=['accuracy'])

The loss function is adapted from the tensorflow probability website

def _neg_log_likelihood_bayesian(self, kl): def lossFunction(y_true, y_pred, kl=kl): neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_true, logits=y_pred) loss = neg_log_likelihood + kl return loss return lossFunction

The model is training at the beginning and is really fast in getting to a very good performance (4-5 iterations). But at the same time the overall loss does not seem to be the exact sum of the two model outputs losses

675/675 [==============================] - 0s 301us/sample - loss: 64.2562 - left_loss: 32.5349 - right_loss: 32.5441 - left_acc: 0.9289 - right_acc: 0.9200

the sum of both outputs is 65,079 --> != 64,2562 After 16 iterations it looks like follows

2336/2336 [==============================] - 1s 281us/sample - loss: 1.3558 - left_loss: 1.1477 - right_loss: 1.1368 - left_acc: 0.8104 - right_acc: 0.8249

The overall loss is not near to be the sum of both losses here ... Furthermore the performance is much worse with more training, although both losses are decreasing all the time as well.

I feel like, that the overall loss is just the kl, and the output losses are kl+neg_log_likelihood, which leads to the neg_log_likelihood being not considered? The kl loss is always decreasing, but the neg_log_likelihood increases after some time, because the model is not trained correct?

What do I have to do to solve this problem? or is it a bug?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/575?email_source=notifications&email_token=AFJFSIYCM5S64YY6XPR3HM3QL7EHFA5CNFSM4I3PXHWKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HOKWJGQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFSI6Q6OSU7ZVRYGZHUFDQL7EHFANCNFSM4I3PXHWA .

alexv1247 commented 5 years ago

I don't quite understand the model. If you just want twice as many outputs why not ask for twice as many and forego the two separate stacks? What is the shape of inp, l, r, and your training xs and labels? Seems like the nll would explain the difference between the kl losses and the total? What is the nll? Also be careful that the nll/kl are both scalar otherwise one will be broadcast and count twice. On Sat, Sep 28, 2019, 5:30 PM alexv1247 @.**> wrote: I try to build a bayesian neural network like the following. I already know, that the loss has to be scaled in every tfp layer to the total number of train examples by adapting the kernel_divergence_fn to it. However The loss is still not the sum of the two model losses, which leads to an decrease in training performance after 4-5 iterations. I have to add, that I am using the model in an active learning environment, which is why the number of training examples increases for every iteration (consist of 20 epochs). I took care of this, by adding the number of training examples as an input to the model, so this should scale correct. import numpy as np from tensorflow import keras import tensorflow_probability as tfp import tensorflow as tf from plot.plot_utils import plot_model_metrics from Custom_Keras_layers.ProbSqueezeExcite import squeeze_excite_block The model looks as follows: inp = keras.layers.Input(shape=[self.timesteps, self.features], name='Inp') trainSize = keras.layers.Input(shape=[self.timesteps, self.features], name='trainSize') trainNum = keras.backend.mean(trainSize) trainNum = keras.backend.mean(trainNum) # left side # 1 Conv1D block l = tfp.layers.Convolution1DReparameterization(filters=2self.features, kernel_size=2, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(inp) l = keras.layers.BatchNormalization()(l) if squeeze_excite == 1: l = squeeze_excite_block(l) l = keras.layers.Dropout(dropout_rate)(l) # 1 Conv1D block l = tfp.layers.Convolution1DReparameterization(filters=4 self.features, kernel_size=4, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum, dilation_rate=2)(l) l = keras.layers.BatchNormalization()(l) if squeeze_excite == 1: l = squeeze_excite_block(l) l = keras.layers.Dropout(dropout_rate)(l) # 1 lstm bock l = keras.layers.LSTM(32, recurrent_dropout=dropout_rate, dropout=dropout_rate)(l, training=True) # letf output layer l = tfp.layers.DenseReparameterization(self.classes, activation=tf.nn.softmax, name='left', kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(l) # right side # 1 Conv1D block r = tfp.layers.Convolution1DReparameterization(filters=2 self.features, kernel_size=2, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(inp) r = keras.layers.BatchNormalization()(r) if squeeze_excite == 1: r = squeeze_excite_block(r) r = keras.layers.Dropout(dropout_rate)(r) # 1 Conv1D block r = tfp.layers.Convolution1DReparameterization(filters=4 * self.features, kernel_size=4, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum, dilation_rate=2)(r) r = keras.layers.BatchNormalization()(r) if squeeze_excite == 1: r = squeeze_excite_block(r) r = keras.layers.Dropout(dropout_rate)(r) # 1 lstm bock r = keras.layers.LSTM(32, recurrent_dropout=dropout_rate, dropout=dropout_rate)(r, training=True) # right output layer r = tfp.layers.DenseReparameterization(self.classes, activation=tf.nn.softmax, name='right', kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(r) self.model = keras.models.Model(inputs=[inp, trainSize], outputs=[l, r]) leftLosses = self.model.losses[0] + self.model.losses[2] + self.model.losses[4] rightLosses = self.model.losses[1] + self.model.losses[3] + self.model.losses[5] self.model.compile(optimizer='adam', loss=self._neg_log_likelihood_bayesian(kl=sum(self.model.losses)), metrics=['accuracy']) The loss function is adapted from the tensorflow probability website def _neg_log_likelihood_bayesian(self, kl): def lossFunction(y_true, y_pred, kl=kl): neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_true, logits=y_pred) loss = neg_log_likelihood + kl return loss return lossFunction The model is training at the beginning and is really fast in getting to a very good performance (4-5 iterations). But at the same time the overall loss does not seem to be the exact sum of the two model outputs losses 675/675 [==============================] - 0s 301us/sample - loss: 64.2562 - left_loss: 32.5349 - right_loss: 32.5441 - left_acc: 0.9289 - right_acc: 0.9200 the sum of both outputs is 65,079 --> != 64,2562 After 16 iterations it looks like follows 2336/2336 [==============================] - 1s 281us/sample - loss: 1.3558 - left_loss: 1.1477 - right_loss: 1.1368 - left_acc: 0.8104 - right_acc: 0.8249 The overall loss is not near to be the sum of both losses here ... Furthermore the performance is much worse with more training, although both losses are decreasing all the time as well. I feel like, that the overall loss is just the kl, and the output losses are kl+neg_log_likelihood, which leads to the neg_log_likelihood being not considered? The kl loss is always decreasing, but the neg_log_likelihood increases after some time, because the model is not trained correct? What do I have to do to solve this problem? or is it a bug? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#575?email_source=notifications&email_token=AFJFSIYCM5S64YY6XPR3HM3QL7EHFA5CNFSM4I3PXHWKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HOKWJGQ>, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFSI6Q6OSU7ZVRYGZHUFDQL7EHFANCNFSM4I3PXHWA .

Thanks for your advices, I shrinked down the network. I have the same feeling the the nll is the difference, but why is it not considered in the overall loss? Is it because the kl is not on the same scale as the nll?

brianwa84 commented 5 years ago

Right, I think given the scale of the numbers you posted, the prior is outweighing the likelihood pretty substantially. It should be possible to optimize both, since gradients flow from both, so you might try tinkering with learning rate (lower).

You could downweigh the prior explicitly in the place where you are dividing by the number of data points. I'm not sure what the "principled thing" to do here is, off hand.

On Sun, Sep 29, 2019, 11:46 AM alexv1247 notifications@github.com wrote:

I don't quite understand the model. If you just want twice as many outputs why not ask for twice as many and forego the two separate stacks? What is the shape of inp, l, r, and your training xs and labels? Seems like the nll would explain the difference between the kl losses and the total? What is the nll? Also be careful that the nll/kl are both scalar otherwise one will be broadcast and count twice. … <#m8063242930707992055> On Sat, Sep 28, 2019, 5:30 PM alexv1247 @.**> wrote: I try to build a bayesian neural network like the following. I already know, that the loss has to be scaled in every tfp layer to the total number of train examples by adapting the kernel_divergence_fn to it. However The loss is still not the sum of the two model losses, which leads to an decrease in training performance after 4-5 iterations. I have to add, that I am using the model in an active learning environment, which is why the number of training examples increases for every iteration (consist of 20 epochs). I took care of this, by adding the number of training examples as an input to the model, so this should scale correct. import numpy as np from tensorflow import keras import tensorflow_probability as tfp import tensorflow as tf from plot.plot_utils import plot_model_metrics from Custom_Keras_layers.ProbSqueezeExcite import squeeze_excite_block The model looks as follows: inp = keras.layers.Input(shape=[self.timesteps, self.features], name='Inp') trainSize = keras.layers.Input(shape=[self.timesteps, self.features], name='trainSize') trainNum = keras.backend.mean(trainSize) trainNum = keras.backend.mean(trainNum) # left side # 1 Conv1D block l = tfp.layers.Convolution1DReparameterization(filters=2self.features, kernel_size=2, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(inp) l = keras.layers.BatchNormalization()(l) if squeeze_excite == 1: l = squeeze_excite_block(l) l = keras.layers.Dropout(dropout_rate)(l)

1 Conv1D block l = tfp.layers.Convolution1DReparameterization(filters=4 *

self.features, kernel_size=4, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum, dilation_rate=2)(l) l = keras.layers.BatchNormalization()(l) if squeeze_excite == 1: l = squeeze_excite_block(l) l = keras.layers.Dropout(dropout_rate)(l) # 1 lstm bock l = keras.layers.LSTM(32, recurrent_dropout=dropout_rate, dropout=dropout_rate)(l, training=True) # letf output layer l = tfp.layers.DenseReparameterization(self.classes, activation=tf.nn.softmax, name='left', kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(l) # right side # 1 Conv1D block r = tfp.layers.Convolution1DReparameterization(filters=2 * self.features, kernel_size=2, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(inp) r = keras.layers.BatchNormalization()(r) if squeeze_excite == 1: r = squeeze_excite_block(r) r = keras.layers.Dropout(dropout_rate)(r)

1 Conv1D block r = tfp.layers.Convolution1DReparameterization(filters=4 *

self.features, kernel_size=4, padding='same', activation=tf.nn.relu, kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum, dilation_rate=2)(r) r = keras.layers.BatchNormalization()(r) if squeeze_excite == 1: r = squeeze_excite_block(r) r = keras.layers.Dropout(dropout_rate)(r) # 1 lstm bock r = keras.layers.LSTM(32, recurrent_dropout=dropout_rate, dropout=dropout_rate)(r, training=True) # right output layer r = tfp.layers.DenseReparameterization(self.classes, activation=tf.nn.softmax, name='right', kernel_divergencefn=lambda q, p, : tfp.distributions.kl_divergence(q, p) / trainNum)(r) self.model = keras.models.Model(inputs=[inp, trainSize], outputs=[l, r]) leftLosses = self.model.losses[0] + self.model.losses[2] + self.model.losses[4] rightLosses = self.model.losses[1] + self.model.losses[3] + self.model.losses[5] self.model.compile(optimizer='adam', loss=self._neg_log_likelihood_bayesian(kl=sum(self.model.losses)), metrics=['accuracy']) The loss function is adapted from the tensorflow probability website def _neg_log_likelihood_bayesian(self, kl): def lossFunction(y_true, y_pred, kl=kl): neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_true, logits=y_pred) loss = neg_log_likelihood + kl return loss return lossFunction The model is training at the beginning and is really fast in getting to a very good performance (4-5 iterations). But at the same time the overall loss does not seem to be the exact sum of the two model outputs losses 675/675 [==============================] - 0s 301us/sample - loss: 64.2562 - left_loss: 32.5349 - right_loss: 32.5441 - left_acc: 0.9289 - right_acc: 0.9200 the sum of both outputs is 65,079 --> != 64,2562 After 16 iterations it looks like follows 2336/2336 [==============================] - 1s 281us/sample - loss: 1.3558 - left_loss: 1.1477 - right_loss: 1.1368 - left_acc: 0.8104 - right_acc: 0.8249 The overall loss is not near to be the sum of both losses here ... Furthermore the performance is much worse with more training, although both losses are decreasing all the time as well. I feel like, that the overall loss is just the kl, and the output losses are kl+neg_log_likelihood, which leads to the neg_log_likelihood being not considered? The kl loss is always decreasing, but the neg_log_likelihood increases after some time, because the model is not trained correct? What do I have to do to solve this problem? or is it a bug? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#575 https://github.com/tensorflow/probability/issues/575?email_source=notifications&email_token=AFJFSIYCM5S64YY6XPR3HM3QL7EHFA5CNFSM4I3PXHWKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HOKWJGQ>, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFSI6Q6OSU7ZVRYGZHUFDQL7EHFANCNFSM4I3PXHWA .

Thanks for your advices, I shrinked down the network. I have the same feeling the the nll is the difference, but why is it not considered in the overall loss? Is it because the kl is not on the same scale as the nll?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/575?email_source=notifications&email_token=AFJFSI7UUMLINIYQ7KQ7GJLQMDEVTA5CNFSM4I3PXHWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD73YEFY#issuecomment-536314391, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFSI775V5JYRE2LQU63K3QMDEVTANCNFSM4I3PXHWA .

alexv1247 commented 5 years ago

Yeah I found that if I divide the prior by a much bigger number, the training seems to be more sophisticated and reasonable. But as you said, there should be some kind of guidance.