This PR fixes the issues we were having with the mixture-based activation functions for the LSTM layer.
Changes Made
The changes made are simple; take max(1E-6, var_a) for the mixture_sigmoid and mixture_tanh activation functions.
Note for Reviewers
I diagnosed the issue by first benchmarking the ReLU vs. MixtureReLU, Sigmoid vs. MixtureSigmoid & Tanh vs. MixtureTanh on the MNIST dataset using FNN and CNN architectures. The bottomline is that ReLU outperforms MixtureReLU but MixtureSigmoid & MixtureTanh outperform their locally linearized counterpart. In no case there was any numerical errors while running the classification setup.
When running on time series with the LSTM architecture, I realized that the nan values occurred only when both the candidate and input gates were using mixture-based activation functions. By imposing a minimum value of 1E-6 for the variance of the mixture sigmoid and tanh gates, we fix the issue without affecting the performance on either the LSTM nor on the classification.
I did not change the activation function for the LSTM layer in this PR as we will first go through an extensive benchmark to decide which of the locally linearized or the mixture-based perform better. This fix should enable to go forward with the remax for the attention mechanism.
Description
This PR fixes the issues we were having with the mixture-based activation functions for the LSTM layer.
Changes Made
The changes made are simple; take
max(1E-6, var_a)
for themixture_sigmoid
andmixture_tanh
activation functions.Note for Reviewers
I diagnosed the issue by first benchmarking the
ReLU
vs.MixtureReLU
,Sigmoid
vs.MixtureSigmoid
&Tanh
vs.MixtureTanh
on the MNIST dataset usingFNN
andCNN
architectures. The bottomline is thatReLU
outperformsMixtureReLU
butMixtureSigmoid
&MixtureTanh
outperform their locally linearized counterpart. In no case there was any numerical errors while running the classification setup.When running on time series with the LSTM architecture, I realized that the
nan
values occurred only when both the candidate and input gates were using mixture-based activation functions. By imposing a minimum value of 1E-6 for the variance of the mixture sigmoid and tanh gates, we fix the issue without affecting the performance on either the LSTM nor on the classification.I did not change the activation function for the LSTM layer in this PR as we will first go through an extensive benchmark to decide which of the locally linearized or the mixture-based perform better. This fix should enable to go forward with the remax for the attention mechanism.