probml / rebayes

Recursive Bayesian Estimation (Sequential / Online Inference)
MIT License
55 stars 5 forks source link

ELLA #53

Open murphyk opened 1 year ago

murphyk commented 1 year ago

Z. Deng, F. Zhou, and J. Zhu, “Accelerated Linearized Laplace Approximation for Bayesian Deep Learning,” in Advances in Neural Information Processing Systems, Oct. 2022 [Online]. Available: https://openreview.net/forum?id=jftNpltMgz. [Accessed: Apr. 29, 2023] https://github.com/thudzj/ELLA

That is, they do the same thing for FCEKF. Additional elements of the Lofi version: (1) W acts as pseudo-observations, compressing the observations in J. (2) Upsilon rescales the kernel, modifying the base NTK.

They use the GP formulation to motivate a low-rank approximation that is complementary to ours, compressing along the model parameters (P) rather than along the training set (NC). In other words, the true posterior precision equals J' J where J is the NCxP matrix that concatenates Jacobians from all training examples (for simplicity I'm absorbing outcome covariance, our R or their Lambda, into J as I've done elsewhere). Lofi approximates J' J by W W' where W is PxL with small L. Their ELLA starts with the GP formulation, which uses J J', and approximates it as Phi' Phi where Phi is KxNC with small K.

Their rank reduction is also based on SVD, applied to the kernel matrix of a random sample of the training data (of size M, with K ≤ M < NC). Because SVD of J J' and J' J are equivalent (i.e., retain the same information), their method should be close (possibly identical, aside from our SSM assumptions) to a hypothetical version of Lofi that was based on M random past observations. That is, imagine we could compute the exact posterior precision based on those M observations and then compute rank-K SVD. This analysis suggests we should outperform them, since, instead of a random subset of the data, we use all the data, updated online and filtered through our SSM.