kyr-pol commented 5 years ago

…r.py and additions rewards.py, expained further in the pull request.

Added 3 extra tasks:

a pendulum swing up
a double inverted pendulum stabilisation (mujoco)
a swimmer robot (mujoco) Each task is solved in a separate file.

For the swing up task, I modified the gym environment's initial conditions, setting the pendulum in the bottom position without velocity. PILCO in general needs a specific starting state to successfully plan from.

For the double pendulum task, a wrapper is used that terminates the episode when the pendulum reaches the limits of its state space, since this creates non-smooth behaviour that is hard for PILCO to model. Additionally, angles in rads are calculated from the sin and cos representation, reducing the state space dimensions (think of this as a much simpler version of the state augmentation the original PILCO uses).

For the swimmer, a wrapper is also used, that augments the state space by one state, that is actually the accumulated reward. In the original gym version the reward function is using a hidden state, which violates PILCO assumptions. Still no hidden information is accessed by PILCO, just the formulation is made compatible with its assumptions. Furthermore, I added a composite reward function, that includes penalties for putting the robot's joints to their limits (in terms of angles), again in order to maintain smooth behaviour that is easy for the GP to model.

On another note, I fixed the noise in some of the runs which helps conditioning, and I also added a pretty uninformative prior on the lengthscales and variances, just to penalise extreme values that otherwise occur in the higher dimensional tasks (this is something the original PILCO does too).

kyr-pol commented 5 years ago

I think we should add an option in the PILCO constructor for priors, because they have to be defined before the model is compiled (afaik), and for the moment I have hard coded them (they are general enough that they probably help with all environments, but still not best practice).

codecov-io commented 5 years ago

Codecov Report

Merging #23 into master will decrease coverage by 4.81%. The diff coverage is 47.22%.

@@            Coverage Diff            @@
##           master     #23      +/-   ##
=========================================
- Coverage   95.12%   90.3%   -4.82%     
=========================================
  Files           7       7              
  Lines         328     361      +33     
=========================================
+ Hits          312     326      +14     
- Misses         16      35      +19

Impacted Files	Coverage Δ
pilco/models/pilco.py	`93.33% <100%> (+0.39%)`	:arrow_up:
pilco/models/mgpr.py	`100% <100%> (ø)`	:arrow_up:
pilco/rewards.py	`61.11% <26.92%> (-32%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c923040...659f0e7. Read the comment docs.

kyr-pol commented 5 years ago

Possibly the extra reward functions etc, if we think they are env specific, can be kept in the swimmer.py file.

Also for the slight change in the policy optimisation function: I don't think we have to always cold start the optimisation by randomising, we can run it once using the last values as initialisation, and then randomise.

nrontsis commented 5 years ago

Amazing work; I will work on it later this week.

I definitely agree about the priors; an easy to use interface might be a great selling point for our implementation.

Furthermore, I think that we should:

Extract parts of the code that is reoccurring to relevant functions that will be defined only once and used by all the examples.
Write a Readme/Jupyter notebook detailing the challenges of each environment. This would be a great resource for someone wanting to use PILCO and/or our implementation.

After this is done, we could include the environments in unit tests, requiring any new versions of the library to solve them. This would allow automated testing of new ideas without requiring to manually try their validity on real world examples.

kyr-pol commented 5 years ago

Did some work on these two points, check the added notebook, it's in progress, but what do you think of a structure more or less like that? I though it'd be helpful for users getting started, stuck at a task with an pilco running but not seemingly learning.

We could add at the end information more specific to what we did in the examples we included too.

nrontsis / PILCO

Added 3 more gym environment examples. Small changes to pilco.py, mgp… #23

Codecov Report