pyro-ppl / pyro

Deep universal probabilistic programming with Python and PyTorch
http://pyro.ai
Apache License 2.0
8.55k stars 987 forks source link

Efficient posterior update and posterior sampling in Gaussian processes #1258

Open ae-foster opened 6 years ago

ae-foster commented 6 years ago

In the current pyro GP library, GP objects like GPRegression recompute the full kernel matrix each time they are optimized https://github.com/uber/pyro/blob/dev/pyro/contrib/gp/models/gpr.py#L80

Furthermore, when sampling from the GP posterior, we recompute the kernel matrix again https://github.com/uber/pyro/blob/dev/pyro/contrib/gp/models/gpr.py#L128 and the cross-kernel-matrix with the new data points https://github.com/uber/pyro/blob/dev/pyro/contrib/gp/util.py#L216

In each case, the kernel matrix is then decomposed by calling potrf ( = Cholesky decomposition, no idea why they named it potrf) so the inverse need not be computed explicitly. This step can fail when the kernel matrix is ill-conditioned (leaving the user with a message about the leading minor of order n not being positive definite).

Caching

It's clear we can save some time by caching appropriately. This will be particularly useful in the case of repeated, independent sampling from the GP posterior. Here we currently recompute the kernel matrix and its Cholesky decomposition each time.

Schur complements

In the case where we iteratively condition the GP on more and more data, or equivalently, take repeated dependent samples from the GP posterior, an alternative is to use Schur complements. This involves taking the full inverse of the original kernel matrix. However, at later steps we need only invert the kernel matrix of the new points, not the whole thing. For more details see https://en.wikipedia.org/wiki/Schur_complement . Sampling when the precision matrix (inverse of kernel matrix) is known is very easy.

This was relevant to https://github.com/uber/pyro/pull/1250 . We want to take repeated dependent samples from the GP posterior but don't know up front which locations we want to sample at.

Not conditioning on every point

If the same point, or one very close to an existing point, is added again to the GP this causes instability because the kernel matrix becomes almost singular. It would be nice to have a more principled approach to decide when to not condition on new data. My current implementation simply prevents the log-determinant from going below a threshold- downside is that the order of adding data now matters.

This was relevant to https://github.com/uber/pyro/pull/1250 . Samples were being drawn at x_n where x_n was a convergent sequence.

Next steps

Decide which, if any, of the above suggestions deserve attention and implementation

eb8680 commented 6 years ago

cc @fehiepsi

ae-foster commented 6 years ago

That paper from ICML that was discussed today about approximate techniques for GPs at large scale (http://proceedings.mlr.press/v80/pleiss18a/pleiss18a.pdf) could give more research-frontier techniques for addressing these kinds of issues efficiently. Also @fritzo said today that there is some plan to merge pyro's GP support with other GP libraries.

fehiepsi commented 6 years ago

Basically, I agree with what @ae-foster suggested.

About "Caching", if we want to sample repeatedly from posterior, then it is efficient to cache repeated computation.

About "Schur complements", the explanation makes sense (though I am a bit worried about the precision issues when working with inversed matrix).

About "Not conditioning on every point", it might be a nice feature. To prevent the instability, we have to add jitterbut if we set jitter too big, our inference might be wrong. So if we set some tolerant for X to not condition on nearby inputs, things would be better.

On the next steps, I will be busy for about two months (family reunion) so I can't go along with the implementation (if any) for a while.