nrontsis / PILCO

Bayesian Reinforcement Learning in Tensorflow
MIT License
311 stars 84 forks source link

Added 3 more gym environment examples. Small changes to pilco.py, mgp… #23

Closed kyr-pol closed 5 years ago

kyr-pol commented 5 years ago

…r.py and additions rewards.py, expained further in the pull request.

Added 3 extra tasks:

For the swing up task, I modified the gym environment's initial conditions, setting the pendulum in the bottom position without velocity. PILCO in general needs a specific starting state to successfully plan from.

For the double pendulum task, a wrapper is used that terminates the episode when the pendulum reaches the limits of its state space, since this creates non-smooth behaviour that is hard for PILCO to model. Additionally, angles in rads are calculated from the sin and cos representation, reducing the state space dimensions (think of this as a much simpler version of the state augmentation the original PILCO uses).

For the swimmer, a wrapper is also used, that augments the state space by one state, that is actually the accumulated reward. In the original gym version the reward function is using a hidden state, which violates PILCO assumptions. Still no hidden information is accessed by PILCO, just the formulation is made compatible with its assumptions. Furthermore, I added a composite reward function, that includes penalties for putting the robot's joints to their limits (in terms of angles), again in order to maintain smooth behaviour that is easy for the GP to model.

On another note, I fixed the noise in some of the runs which helps conditioning, and I also added a pretty uninformative prior on the lengthscales and variances, just to penalise extreme values that otherwise occur in the higher dimensional tasks (this is something the original PILCO does too).

kyr-pol commented 5 years ago

I think we should add an option in the PILCO constructor for priors, because they have to be defined before the model is compiled (afaik), and for the moment I have hard coded them (they are general enough that they probably help with all environments, but still not best practice).

codecov-io commented 5 years ago

Codecov Report

Merging #23 into master will decrease coverage by 4.81%. The diff coverage is 47.22%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #23      +/-   ##
=========================================
- Coverage   95.12%   90.3%   -4.82%     
=========================================
  Files           7       7              
  Lines         328     361      +33     
=========================================
+ Hits          312     326      +14     
- Misses         16      35      +19
Impacted Files Coverage Δ
pilco/models/pilco.py 93.33% <100%> (+0.39%) :arrow_up:
pilco/models/mgpr.py 100% <100%> (ø) :arrow_up:
pilco/rewards.py 61.11% <26.92%> (-32%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c923040...659f0e7. Read the comment docs.

kyr-pol commented 5 years ago

Possibly the extra reward functions etc, if we think they are env specific, can be kept in the swimmer.py file.

Also for the slight change in the policy optimisation function: I don't think we have to always cold start the optimisation by randomising, we can run it once using the last values as initialisation, and then randomise.

nrontsis commented 5 years ago

Amazing work; I will work on it later this week.

I definitely agree about the priors; an easy to use interface might be a great selling point for our implementation.

Furthermore, I think that we should:

After this is done, we could include the environments in unit tests, requiring any new versions of the library to solve them. This would allow automated testing of new ideas without requiring to manually try their validity on real world examples.

kyr-pol commented 5 years ago

Did some work on these two points, check the added notebook, it's in progress, but what do you think of a structure more or less like that? I though it'd be helpful for users getting started, stuck at a task with an pilco running but not seemingly learning.

We could add at the end information more specific to what we did in the examples we included too.