Open xht033 opened 5 years ago
So the issue is that the correct usage of TFP's layers involves scaling the KL divergence (stored inside the Models's losses
property) by the number of examples (see here). Unfortunately, when you do model.compile
the scaling is not done so the model ends up being over-regularized.
Here's one way to resolve this, override the DenseReparameterization
kernel_divergence_fn
argument to perform the necessary scaling:
model.add(tfp.layers.DenseReparameterization(
10,
activation = 'softmax',
input_shape=(784,),
kernel_divergence_fn=lambda q, p, _: tfp.distributions.kl_divergence(q, p) / tf.to_float(train.num_examples)))
This should do the right thing at least for the train loss. Depending on how you interpret what validation loss should be it'll do the right thing for that as well.
Can I ask why we divide the total number of training_data_set?
Hi there, thanks for that advice about how to handle the keras compile issue. At first I thought, it does solve my problem. But after 3 trainings iterations while adding new data to the trainig set at each iteration (i am doing active learning), the loss for both model outputs were getting bigger again. I choose to scale the kl_divergence with the batchsize. Is that correct? or do I need do scale it with total number of training examples that model will face during a training iteration?
@SiegeLordEx Could you please answer the question above by @xht033 and @alexv1247?
If we perform mini-batch stochastic gradient descent, shouldn't we divide the KL loss by the size of the mini-batch dataset rather than the size of the whole training dataset? I think the answer to this question depends on when the KL loss (regularisation) is added to the final loss, which should be at each training step (so after each mini-match).
Apparently, in the example https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/bayesian_neural_network.py, you also divide the kl
term by the total number of training examples, even though you perform mini-batch stochastic gradient descent.
i have a similar problem with DistributionLambda
layers when using a custom metric. what gets passed as yhat to the metric function is a sample drawn from the distribution. would be better pass the actual distribution, as is done for custom loss functions. the workaround is to specify convert_to_tensor_fn
as lambda t: t.loc
.
should i open a separate issue or is this similar enough to the problem above with DenseReparameterization
?
When I use my custom loss function, I got a wrong loss output if I choose keras.compile:
However, If I replace the bayesian layer by a traditional Dense layer, it seems everything is correct.