In the paper, you mention that your implementation of XdG gated 80% of hidden units, randomly selected; how much tweaking did you do to arrive at that hyperparameterization? Further, how do you think changing this percentage will affect learning, forgetting and performance on an increasing number of tasks?
thanks for presenting!
In the paper, you mention that your implementation of XdG gated 80% of hidden units, randomly selected; how much tweaking did you do to arrive at that hyperparameterization? Further, how do you think changing this percentage will affect learning, forgetting and performance on an increasing number of tasks?