mathematically analyzing why this architecture works

Thank you for presenting such an inspiring research! While you are inspired by neuroscientific studies of the brain, I am wondering how the performance of the model can be analyzed in the context of online learning theory. If we consider the training and testing process of each permutation as a "period" in online learning and consider the goal of learning all the different tasks as optimizing the average loss of the whole learning process, traditional ANN seems like a simple Follow the Leader learning rule, while the context signal performs as a regularizing term, which makes your architecture works like a Follow the Regularized Leader rule. And the stability of Follow the Regularized Leader is probably why your architecture works better.

uchicago-computation-workshop / nicolas_masse

mathematically analyzing why this architecture works #3