Description

This PR fixes the issues we were having with the mixture-based activation functions for the LSTM layer.

Changes Made

The changes made are simple; take max(1E-6, var_a) for the mixture_sigmoid and mixture_tanh activation functions.

Note for Reviewers

I diagnosed the issue by first benchmarking the ReLU vs. MixtureReLU, Sigmoid vs. MixtureSigmoid & Tanh vs. MixtureTanh on the MNIST dataset using FNN and CNN architectures. The bottomline is that ReLU outperforms MixtureReLU but MixtureSigmoid & MixtureTanh outperform their locally linearized counterpart. In no case there was any numerical errors while running the classification setup.

When running on time series with the LSTM architecture, I realized that the nan values occurred only when both the candidate and input gates were using mixture-based activation functions. By imposing a minimum value of 1E-6 for the variance of the mixture sigmoid and tanh gates, we fix the issue without affecting the performance on either the LSTM nor on the classification.

I did not change the activation function for the LSTM layer in this PR as we will first go through an extensive benchmark to decide which of the locally linearized or the mixture-based perform better. This fix should enable to go forward with the remax for the attention mechanism.

Note: In the first PR, I forgot to also modify the CUDA code. I have tested the new changes on both the mixture_sigmoid_mean_var_cuda and mixture_tanh_mean_var_cuda on the MNIST classification and the LSTM time series forecasting examples.

lhnguyen102 / cuTAGI

Mixture based activation fix - Second PR #69

Description

Changes Made

Note for Reviewers