Maybe the last is superfluous, but I'd like to see if Adam or Adagrad (with reset) will help learning go faster & better than simply using Nesterov momentum.
First tests show no significant differences in quickness of the reduction of macrobatch training error. But maybe differences will show up in longer runs over multiple epochs.
Maybe the last is superfluous, but I'd like to see if Adam or Adagrad (with reset) will help learning go faster & better than simply using Nesterov momentum.