pyro-ppl / pyro

Deep universal probabilistic programming with Python and PyTorch
http://pyro.ai
Apache License 2.0
8.5k stars 981 forks source link

Batching Gaussian Processes #1679

Open mileslucas opened 5 years ago

mileslucas commented 5 years ago

To begin I tried logging in with GitHub and also creating an account on the pyro forums, but neither of those is working.

Problem

I need to fit a batch of four independent Gaussian Processes and I don't want to have to use for loops for fitting each one. The current GP's are able to broadcast properly to my outputs, but I can't batch them so that the inputs are independent.

My input data is a tensor of shape (264, 3). So each input has 3 feature dimensions. My observed data has shape (4, 264). where 4 is my batch shape and signifies the four independent Gaussian processes I'm trying to train.

I'm using an RBF kernel and the formula I want looks like this- image where X (and Z) have shape (Nsamples, 3)

This ARD kernel is easy enough to implement like

amplitude = torch.tensor(20.)
lengthscale = torch.tensor([300., 30., 30.])
kernel = gp.kernels.RBF(input_dim=3, variance=amplitude, lengthscale=lengthscale)

My first problem

In my derivation, I use priors on my variance and lengthscales such that each lengthscale in the kernel has its own Gamma prior and the variance has a uniform prior

image

Unfortunately I can't set a multidimensional prior on the kernel lengthscale, regardless of kernel dimensions-

# Dummy data
X = torch.ones((264, 3))
y = torch.ones((4, 264)) 

amplitude = torch.tensor(20.)
lengthscale = torch.tensor([300., 30., 30.])
priors = torch.tensor([
    [2, 2, 2], 
    [0.0075, 0.075, 0.075]
])
kernel = gp.kernels.RBF(input_dim=3, variance=amplitude, lengthscale=lengthscale)
kernel.set_prior("lengthscale", dist.Gamma(priors[0], priors[1]))
kernel.set_prior("variance", dist.Uniform(10, 150))

gpr = gp.models.GPRegression(X, y, kernel, noise=torch.tensor(1.))

# Optimize
optimizer = torch.optim.Adam(gpr.parameters(), lr=0.01)
loss_fn = pyro.infer.Trace_ELBO().differentiable_loss
losses = []
num_steps = int(1e4)
for i in tqdm.trange(num_steps):
    optimizer.zero_grad()
    loss = loss_fn(gpr.model, gpr.guide)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

yields ValueError: Model and guide event_dims disagree at site 'GPR/RBF/lengthscale': 0 vs 1

I can get the same code to work above without the priors.

Batching Problems

Beyond the priors, I also would really like to batch my GPs to take advantage of Tensor math or even parallelization instead of doing a for loop for each of my batch dimensions. I have tried a similar approach to my problems in TensorFlow Probability and I had a lot of struggle with setting up the ARD kernel and overall using TF but their batch scheme worked really well.

The approach I would like to use is something like this

# Dummy data
X = torch.ones((264, 3))
y = torch.ones((4, 264)) 

amplitude = 20 * torch.ones(4)
lengthscale = torch.tensor(np.tile([300., 30., 30.], (4, 1)))
kernel = gp.kernels.RBF(input_dim=3, variance=amplitude, lengthscale=lengthscale)

gpr = gp.models.GPRegression(X, y, kernel, noise=torch.ones(4))

optimizer = torch.optim.Adam(gpr.parameters(), lr=0.01)
loss_fn = pyro.infer.Trace_ELBO().differentiable_loss
losses = []
num_steps = int(1e4)
for i in tqdm.trange(num_steps):
    optimizer.zero_grad()
    loss = loss_fn(gpr.model, gpr.guide)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

This fails with RuntimeError: The size of tensor a (264) must match the size of tensor b (4) at non-singleton dimension 0

Moreover, I'd love to bring priors into this to have a final working product like

# Dummy data
X = torch.ones((264, 3))
y = torch.ones((4, 264)) 

amplitude = 20 * torch.ones(4)
lengthscale = torch.tensor(np.tile([300., 30., 30.], (4, 1)))
kernel = gp.kernels.RBF(input_dim=3, variance=amplitude, lengthscale=lengthscale)
gpriors = torch.tensor([
    [2, 2, 2], 
    [0.0075, 0.075, 0.075]
])
kernel.set_prior("lengthscale", dist.Gamma(gpriors[0], gpriors[1]))
kernel.set_prior("variance", dist.Uniform(10, 150))

gpr = gp.models.GPRegression(X, y, kernel, noise=torch.ones(4))

optimizer = torch.optim.Adam(gpr.parameters(), lr=0.01)
loss_fn = pyro.infer.Trace_ELBO().differentiable_loss
losses = []
num_steps = int(1e4)
for i in tqdm.trange(num_steps):
    optimizer.zero_grad()
    loss = loss_fn(gpr.model, gpr.guide)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())
martinjankowiak commented 5 years ago

i suggest using a combination of pyro and gpytorch instead of pure pyro.contrib.gp. importantly for you, gpytorch has better support for batched GPs. priors and the like you may want to do in pyro.

for example, see here:

https://github.com/cornellius-gp/gpytorch/tree/master/examples/09_Pyro_Integration

we are in any case considering integrating pyro.contrib.gp more closely with gpytorch.

fehiepsi commented 5 years ago

I think that you can resolve first question with: kernel.set_prior("lengthscale", dist.Gamma(priors[0], priors[1]).to_event())

For second question, if you want a multitask GP, you can use VariationalGP class and define latent_shape = torch.Size([4]).

mileslucas commented 5 years ago

@fehiepsi First line works great, thank you. I think that would be worth putting in the documentation- happy to do so but I don't really know the mechanics behind it.

For the second, no dice, yet. Here is my code

amplitude = 20 * torch.ones(4)
lengthscale = torch.tensor(np.tile([300., 30., 30.], (4, 1)))

kernel = gp.kernels.RBF(input_dim=3, variance=amplitude, lengthscale=lengthscale)
priors = torch.tensor(
    [[2, 2, 2], 
    [0.0075, 0.075, 0.075]]
)
kernel.set_prior("lengthscale", dist.Gamma(priors[0], priors[1]).to_event())
kernel.set_prior("variance", dist.Uniform(10, 150))

like = gp.likelihoods.Gaussian(variance=torch.tensor(1.))
gpr = gp.models.VariationalGP(X, y, kernel, like, latent_shape=torch.Size([4]))
optimizer = torch.optim.Adam(gpr.parameters(), lr=0.01)
loss_fn = pyro.infer.Trace_ELBO().differentiable_loss
losses = []
num_steps = int(1e3)
for i in tqdm.trange(num_steps):
    optimizer.zero_grad()
    loss = loss_fn(gpr.model, gpr.guide)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

Fails with RuntimeError: The size of tensor a (264) must match the size of tensor b (4) at non-singleton dimension 0 still.

This is occurring during the calculation X / self.lengthscales

In the docs I also see that latent_shape defaults to Y.shape[:-1] and my Y looks like (4, Nsamples), so I should get away with no specification at all.

e: the VariationalGP works when my amplitude and length scale parameters are not expanded to the batch dimension. IE

# works
amp.shape = ()
length_scale.shape=(3,)
# does not work
amp.shape = (4)
length_scale.shape=(4, 3)
fehiepsi commented 5 years ago

You need to make length_scale=(3,). Variational parameters will learn the correlation between output for you.

Another way to do MultiTask GP is to use Coregionalize kernel. You need to expand input X to have shape (264 x 4, 7), where 4 extra dimensions are the one-hot encoding for 4 GPs. The drawback of this approach is the data will be enlarged, so learning will be slow (unless you use sparse models)! The advantage of this approach is you can see the correlations of each GP to another one by looking at the components and diagonal parameters of Coregionalize kernel.

If you want to do inference independently for each GP, then pls create 4 different GP models.

For your first question, PyTorch/Pyro distributions have event_shape and batch_shape. By default, dist.Normal(...) has event_shape is torch.Size([]). By calling .to_event(), your prior will have a correct event_shape.

mileslucas commented 5 years ago

To clarify your first point- I need the shape of length_scale to be (3,) and through the variational process, I can be given 4 sets of shape (3,) length scales corresponding to each GP?

fehiepsi commented 5 years ago

It is something like this: we get 4 sets of variational parameters and each set is learned by conditioning on each GP output. There is only 1 set of kernel (you can create a more complicated kernel if you want).

mileslucas commented 5 years ago

In that case, I have my code working- thanks for the help!

eb8680 commented 5 years ago

To begin I tried logging in with GitHub and also creating an account on the pyro forums, but neither of those is working.

cc @jpchen

jpchen commented 5 years ago

@mileslucas it seems to be hitting the daily throttles on our email client. could you try again in 24 hours and see if you still have this problem? sorry for the inconvenience, we haven't gotten to migrating off our existing backend for the forum.

mileslucas commented 5 years ago

@fehiepsi @martinjankowiak Not reopening, but I wanted to do some follow up.

I've spent a large amount of time over the past few days learning more about GPs and doing a lot of implementations using Pyro, GPyTorch, TensorFlow probability, and gpflow.

I really enjoy the overall probabilistic framework of Pyro the most and have found that I much prefer PyTorch to TensorFlow for the work I need to do. I also really like how PyTorch and GPyTorch have ARD kernels built in (TFP does not, for some reason).

Gpytorch vs pytorch- I really liked the batching of gpytorch. Not multitask GPs or GPs with latent shapes, but literally just multi-output, independent GPs. It worked really well in GPyTorch. At the end of the day, I would still prefer Pyro, though, for two reasons-

  1. GPyTorch had A LOT of trouble converging and is about 5 times slower than Pyros optimization
  2. I have other probabilistic needs in my code base and will most likely use Pyro to accomplish that. So I'd like to minimize my dependencies to just PyTorch and Pyro.

I've seen variations here and there about incorporating GPyTorch with Pyro, and I'd really like to see the batching features of GPyTorch in Pyro. Almost every other part of PyTorch can use batching and even a lot of the methods within Pyro can use batching. I think it just makes sense and I would love to see a push for batched GPs (again, completely independent batches. I'm not super comfortable with the Pyro framework yet but I could be interested in helping out wher I can- even if it's just documentation.

fehiepsi commented 5 years ago

@mileslucas Theoretically, GPyTorch will be faster per each inference step for large dataset/mini-batch (>1000). I guess you are running your model on small data (or large data with small mini-batch, small inducing points). I have not used GPyTorch yet but GPyTorch has many tuning parameters to control the result (such as number of Conjugate Gradient steps). You can change them to get a better convergence I guess.

I know that there are a lot of room to make Pyro GP better. A new feature such as batch (independent) GPs needs to update Kernel classes (such as using index -1, -2 instead of 0, 1). It is doable and but we don't have much bandwidth (implement, review,...) for it. Sorry about that!

mileslucas commented 5 years ago

I haven't done a ton of debugging on my GPyTorch models but the first big red flag was always converging to negative lengthscales (with no errors in the process, either).

I have a pretty mediocre implementation of a batch Isotropy kernel, perhaps I could get some tips on how to best implement further?

class BatchRBF(gp.kernels.Isotropy):

    def __init__(self, input_dim, variance=None, lengthscale=None, active_dims=None, batch_size=None):
        super().__init__(input_dim, variance, lengthscale, active_dims)

        if batch_size is None:
            if lengthscale.dim() > 1:
                batch_size = lengthscale.shape[:-1]
            elif variance.shape[-1] > 1:
                batch_size = variance.shape
            else:
                batch_size = 1
        self.batch_size = torch.Size([batch_size])        
        self.lengthscale = lengthscale
        self.variance = variance

    def forward(self, X, Z=None, diag=False):
        if diag:
            return self._diag(X)

        r2 = self._square_scaled_dist(X, Z)
        a2 = self.variance[..., None, None]
        s  = torch.exp(-0.5 * r2)
        return  a2 * s

    def _slice_input(self, X):
        if X.dim() >= 2:
            return X[..., self.active_dims]
        elif X.dim() == 1:
            return X.unsqueeze(1)
        else:
            raise ValueError("Input X must be either 1 or 2 dimensional.")

    def _square_scaled_dist(self, X, Z=None):
        r"""
        Returns :math:`\|\frac{X-Z}{l}\|^2`.
        """
        if Z is None:
            Z = X
        X = self._slice_input(X)
        Z = self._slice_input(Z)
        if X.size(1) != Z.size(1):
            raise ValueError("Inputs must have the same number of features.")

        r2 = torch.empty(*self.batch_size, X.shape[-2], X.shape[-2])

        for i in range(self.batch_size[-1]):
            scaled_X = X / self.lengthscale[i]
            scaled_Z = Z / self.lengthscale[i]
            X2 = (scaled_X ** 2).sum(1, keepdim=True)
            Z2 = (scaled_Z ** 2).sum(1, keepdim=True)
            XZ = scaled_X.matmul(scaled_Z.t())
            r2[i] = X2 - 2 * XZ + Z2.t()
        return r2.clamp(min=0)
fehiepsi commented 5 years ago

If you have tested that this class works as expected, then I think the only remaining piece is to update gp.util.conditional function (for prediction). Most operators will support batch (modulo correct transpose operators). It is good that PyTorch has batch Cholesky in its v1.0 version. You only need a batch version of triangle solve, which can be found here: https://pytorch.org/docs/stable/_modules/torch/distributions/multivariate_normal.html.

mileslucas commented 5 years ago

So I've tried fiddling around and I just don't have the skill or knowledge currently to implement this meaningfully. I would really enjoy seeing this kind of feature- its currently a last point of pain for a nice implementation of a specialized likelihood method. I can't really get GpyTorch working for this either- the disjoint between the likelihoods, mlls, kernels, and models is too much to try and incorporate into my own subclass of Parametrized.

fehiepsi commented 5 years ago

@mileslucas This will be addressed in the future. In the meantime, I will make a simple version for you (sometimes during next week).

fehiepsi commented 5 years ago

As discussed in the forum with @mileslucas , it seems that supporting batched gaussian process is not necessary because we can achieve that result without having to use batch. And despite that to support batch, we just add ellipsis at the front of each variable in GP model, we need to introduce additional arg batch_shape and maintain its logic across the model and kernel. This will make the module more complicated and not easy to adapt/modify. Let's keep the module minimal and fun to play with instead. :)

For now, I'll close this issue. I'll reconsider implementing this if more requests come.

JunHanSnap commented 4 years ago

I am asking for batched Gaussian process as I need to a GP model for each sample of batched data. Look forward to see this new feature

fehiepsi commented 4 years ago

@JunHanSnap gpytorch supports batching gp. Could you give it a try? :)

JunHanSnap commented 4 years ago

Thanks! Yeah. I noticed it and are using it.