Implement `predict_on_noisy_input` with MCMC in gpytorch

jaztsong commented 4 years ago

I'm recently starting to re-implement PILCO in pytorch for better intergration with my other works. To leverage the fast prediction (KISS-GP) in gpytorch, I decided to use MCMC sampling approach to implement the core function in mgpr.py - predict_on_noisy_input which use moment matching based on the original paper.

However, the result I got from sampling is dramatically different from moment matching. I wonder if anyone can help me identify the problem. The following code shows both optimize and predict_on_noisy_input.

    def optimize(self,restarts=1, training_iter = 200):
        self.likelihood.train()
        self.model.train()

        # Use the adam optimizer
        optimizer = torch.optim.Adam([
            {'params': self.model.parameters()},  # Includes GaussianLikelihood parameters
            ], lr=self.lr)
        # "Loss" for GPs - the marginal log likelihood
        mll = gpytorch.mlls.ExactMarginalLogLikelihood(self.likelihood, self.model)
        for i in range(training_iter):
            # Zero gradients from previous iteration
            optimizer.zero_grad()
            # Output from model
            output = self.model(self.X)
             # Calc loss and backprop gradients
            loss = -mll(output, self.Y).sum()
            loss.backward()
            print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iter, loss.item()))
            optimizer.step()

    def predict_on_noisy_inputs(self, m, s, num_samps=500):
        """
        Approximate GP regression at noisy inputs via moment matching
        IN: mean (m) (row vector) and (s) variance of the state
        OUT: mean (M) (row vector), variance (S) of the action
             and inv(s)*input-ouputcovariance

        We adopt the sampling approach by leveraging the power of GPU
        """
        assert(m.shape[1] == self.num_dims and s.shape == (self.num_dims,self.num_dims))
        self.likelihood.eval()
        self.model.eval()

        if self.cuda == True:
            m = torch.tensor(m).float().cuda()
            s = torch.tensor(s).float().cuda()
            inv_s = torch.inverse(s)

        sample_model = torch.distributions.MultivariateNormal(m,s)
        pred_inputs = sample_model.sample((num_samps,)).float()
        pred_inputs[pred_inputs != pred_inputs] = 0
        pred_inputs,_ = torch.sort(pred_inputs,dim=0)
        pred_inputs = pred_inputs.reshape(num_samps,self.num_dims).repeat(self.num_outputs,1,1)

        #centralize X ?
        # self.model.set_train_data(self.centralized_input(m),self.Y)
        with torch.no_grad(), gpytorch.settings.fast_pred_var():
            pred_outputs = self.model(pred_inputs)

        #Calculate mean, variance and inv(s)* input-output covariance
        M = torch.mean(pred_outputs.mean,1)[None,:]
        V_ = torch.cat((pred_inputs[0].t(),pred_outputs.mean),0)
        fact = 1.0 / (V_.size(1) - 1)
        V_ -= torch.mean(V_, dim=1, keepdim=True)
        V_t = V_.t()  # if complex: mt = m.t().conj()
        covs =  fact * V_.matmul(V_t).squeeze()
        V = covs[0:self.num_dims,self.num_dims:]
        V = inv_s @ V
        S = covs[self.num_dims:,self.num_dims:]

        return M, S, V

kyr-pol commented 4 years ago

Hi @jaztsong ,

I haven't worked with GPytorch or pytorch, so I wouldn't know if some specific detail is wrong with your implementation, but the logic seems ok. A more general comment: I think what you are doing here is the unscented transform, which is a reasonable alternative to moment matching, but it might make policy optimisation harder (in our version for the gradients of the cost w.r.t. the controller's parameters we back propagate through the moment matching equations).

Now for your problem, how did you compare the sampling results to moment matching? Did you rewrite moment matching in pytorch? If so, did you test against our version? Otherwise, if you are comparing to our implementation of predict_on_noisy_input show us a MWE, which in this case I guess should have a small dataset and a GP model defined both in gpflow and gpytorch with the same hyperparameters. Then we should verify that on normal inputs (single points, not probabilistic) their predictions are the same and after that we'd look for a given noisy test point whether the predicted mean, variance and input-output covariance are similar (preferably with a large number of samples just to be sure).

jaztsong commented 4 years ago

@kyr-pol Thanks for the reponse.

I'm only using mgpr for dynamics model, not for controller so far. So the gradients of controller's parameters doesn't travel through mgpr.

Currently, the way I'm testing with tests/test_prediction.py by comparing the sampling result against the matlab code. The mean is not too bad with the relative error around 10~30%. But the variance and input-output covariance can be way off the mark.

kyr-pol commented 4 years ago

What's your GP kernel's hyperparameters and what's the covariance on the initial state? For very small covariances the estimates with both methods should converge to the predicted values for mean and variance from the standard GP equations. If one or both of them don't, then you know what's wrong. If they do, but when the initial covariance increases the estimates diverge (and increasing the number of samples doesn't help) it could be just moment matching failing (after all it is an approximation).

jaztsong commented 4 years ago

Thanks. I will run more experiments to compare the outputs.

One more issue I need to consult your wisdom is that I had zero success to get a good policy in the examples (inv_double_pendulum.py and inverted_pendulum.py) in this repo (using all your code).

The only change I made is to replace mojoco-based environment with roboschool-based environment. They are supposed to be identical to each other.

Any thoughts?

kyr-pol commented 4 years ago

So you are using the code from the examples folder, just with roboschool environments instead of gym? If so, I would expect it to be a matter of different representation of states, different scaling etc., so you would have to set a different initial state, goal state and so on. Maybe the notebook in the examples folder can help: Hyperparameter setting and troubleshooting tips.ipynb. Good luck!

jaztsong commented 4 years ago

@kyr-pol Thank you for your suggestion. I did check the state definitions of roboschool vs gym, it looks like they are meant to be identical. But I don't know why it didn't work. I will definitely go check the hyperparameter settings.

An addition quick question is that, if I use a sampling approach, how does the backpropagation flow? Does it consider the sampled points as constant tensors?

kyr-pol commented 4 years ago

Hey, not sure how would pytorch, or tensorflow for that matter, implement backpropagation with sampling, but yes I'd expect the samples to be constant, after all they don't change and are not optimised. I recently came across a paper that uses numerical quadrature for propagating uncertainty in PILCO and the authors calculate gradients too: Numerical Quadrature for Probabilistic Policy Search. I hope that helps!

nrontsis / PILCO

Implement `predict_on_noisy_input` with MCMC in gpytorch #33