Open ae-foster opened 6 years ago
cc @fehiepsi
That paper from ICML that was discussed today about approximate techniques for GPs at large scale (http://proceedings.mlr.press/v80/pleiss18a/pleiss18a.pdf) could give more research-frontier techniques for addressing these kinds of issues efficiently. Also @fritzo said today that there is some plan to merge pyro's GP support with other GP libraries.
Basically, I agree with what @ae-foster suggested.
About "Caching", if we want to sample repeatedly from posterior, then it is efficient to cache repeated computation.
About "Schur complements", the explanation makes sense (though I am a bit worried about the precision issues when working with inversed matrix).
About "Not conditioning on every point", it might be a nice feature. To prevent the instability, we have to add jitter
but if we set jitter
too big, our inference might be wrong. So if we set some tolerant
for X
to not condition on nearby inputs, things would be better.
On the next steps, I will be busy for about two months (family reunion) so I can't go along with the implementation (if any) for a while.
In the current pyro GP library, GP objects like
GPRegression
recompute the full kernel matrix each time they are optimized https://github.com/uber/pyro/blob/dev/pyro/contrib/gp/models/gpr.py#L80Furthermore, when sampling from the GP posterior, we recompute the kernel matrix again https://github.com/uber/pyro/blob/dev/pyro/contrib/gp/models/gpr.py#L128 and the cross-kernel-matrix with the new data points https://github.com/uber/pyro/blob/dev/pyro/contrib/gp/util.py#L216
In each case, the kernel matrix is then decomposed by calling
potrf
( = Cholesky decomposition, no idea why they named it potrf) so the inverse need not be computed explicitly. This step can fail when the kernel matrix is ill-conditioned (leaving the user with a message about the leading minor of ordern
not being positive definite).Caching
It's clear we can save some time by caching appropriately. This will be particularly useful in the case of repeated, independent sampling from the GP posterior. Here we currently recompute the kernel matrix and its Cholesky decomposition each time.
Schur complements
In the case where we iteratively condition the GP on more and more data, or equivalently, take repeated dependent samples from the GP posterior, an alternative is to use Schur complements. This involves taking the full inverse of the original kernel matrix. However, at later steps we need only invert the kernel matrix of the new points, not the whole thing. For more details see https://en.wikipedia.org/wiki/Schur_complement . Sampling when the precision matrix (inverse of kernel matrix) is known is very easy.
This was relevant to https://github.com/uber/pyro/pull/1250 . We want to take repeated dependent samples from the GP posterior but don't know up front which locations we want to sample at.
Not conditioning on every point
If the same point, or one very close to an existing point, is added again to the GP this causes instability because the kernel matrix becomes almost singular. It would be nice to have a more principled approach to decide when to not condition on new data. My current implementation simply prevents the log-determinant from going below a threshold- downside is that the order of adding data now matters.
This was relevant to https://github.com/uber/pyro/pull/1250 . Samples were being drawn at
x_n
wherex_n
was a convergent sequence.Next steps
Decide which, if any, of the above suggestions deserve attention and implementation