Closed nrontsis closed 5 years ago
So I did some experimenting with different gym environments, and mostly the continuous mountain car one ('MountainCarContinuous-v0'). I created a new branch, 'more_envs', to work on this and other environments.
Increasing the car's power, allowing it to climb the hill without momentum-> this works, but it is trivial.
Subsampling and longer horizons. To subsample from the environment, each action is repeated a number of times, and intermediate states are discarded. Experimented with 1 (no subsampling), 2, 4 and 5.
Different controller initialisation. Experimented with larger and smaller variances in the initialisation of the controllers, without significant effect.
Restarting the model. Retraining the model and the controller from scratch, but using the data collected that far, to help the optimisation process restart. No significant improvement, at least with the few restarts ( less than 5) tried.
Tuning the weight matrix of the reward function. Setting the values of the 'W' matrix in the reward function controls how steeply the reward decreases, moving away from the goal state in each direction (smaller weights, slower reward decay).
Fixing the model uncertainty (kernel variance), tested with fixed values at 1, 0.1 and 0.02.
RBF with more basis functions, hoping that in a higher dimensional space the optimisation will have less issues with local minima. Tried 10, 15 and 20.
In most cases, the algorithm is getting stuck on the greedy behaviour of going to the right as much as possible (action = 1 throughout the episode), or some other more or less constant action that pushes to the right (0<action<1).
Possible additions: Diagnostics: We could add plots comparing the predicted trajectory, given an initial state and a controller, to the actual trajectory when the policy is implemented, or the predicted reward and the actual reward. I don't think the model is the issue in this case, but i'd be good to know for sure, and this is functionality that the original implementation has.
Restarts: the optimisation is local, so might benefit from random restarts, for example when no progress has been made for some consecutive steps (this is changing the algorithm though, since the original PILCO, has no such restarts even though other similar works in the literature do).
As we said, we can connect the mountain car to the original Matlab implementation of PILCO and see if it's solved properly. If it is we'd know for sure there is something crucial missing in our implementation, if not it might still be a matter of tuning parameters. Meanwhile I might give some other scenario a shot, maybe the algorithm is just not a good match for this particular environment.
It sounds like a complicated task with many variables to tune.
I can write (probably this week) the connection for Matlab with gym to test PILCO's original implementation. However, I won't have time to tune it (adjust gains for controllers etc) before the end of October. @kyr-pol would you be up for doing that?
Yes, thanks, that would be helpful and I can work on the testing and tuning afterwards if necessary.
I have done some more experiments though, working on the pendulum swing up task as well as the mountain car, and it seems that the model gets quite inaccurate over longer time horizons, more so than the Matlab version on similar scenarios. I'll give some examples below.
Some more observations.
Mountain Car Model inaccuracy, on longer term prediction persists, even with more data points, and also when, by subsampling, every step is long enough that the episode consists of a smaller number of steps. For example (edc6d81) with a subsampling rate of 20, every rollout has just 5 time steps. After collecting 260 data points:
and
where x_pred, s_pred are predictions for the position of the car, X_new are the real values. The task here is to move the car from -0.5 to +0.45 so the differences are quite big.
Pendulum (swing-up) Firstly, this environment causes crashes, probably similar to the ones mentioned in https://github.com/nrontsis/PILCO/issues/7#issuecomment-412454494. In this case, the 3rd GP, predicting the angular velocity, fails to learn after the initial random rollout, and causes numerical errors in the controller optimisation. This doesn't occur when we subsample, or fix the noise to 1e-4 instead of the minimum 1e-6 where it ends up.
There is a similar scenario implemented in Matlab, so I used most hyperparameter settings from it. With subsampling rate 3, and 30 timesteps per episode, after 240 data points (876faacd):
where the dimension plotted corresponds to the cosine of the angle of the pendulum. Whereas from Matlab, with 40 timestep episodes, after training on 120 data points the error:
here the dimension corresponds the angle itself, going from 0 to pi in a successful run.
One difference between the two models is that in the Matlab version, the model has inputs cos, sin, angular velocity and control input and predicts angular velocity and angle, while goes from sin, cos, ang. velocity and control to sin, cos and ang. velocity. It doesn't look like this should be that important though.
Another possibility, mentioned by the original author, is the integrator used in the forward dynamics (Euler for gym, dopri integrator in Matlab).
Okay, thanks for the detailed results.
Another possibility might be differences in the training of the gp? Matlab's implementation has a special optimiser that penalises extreme lengthscales and snr (see hypCurb.m).
Either way it seems complicated. I now think even more that the best way is to link gym with MATLAB's PILCO and see the differences there.
Potentially acrobot, according to Deisenroth's suggestion.