nrontsis / PILCO

Bayesian Reinforcement Learning in Tensorflow
MIT License
311 stars 84 forks source link

Extra control dimension for varying target values #34

Closed ManuelM95 closed 4 years ago

ManuelM95 commented 4 years ago

Hey, I'm a student from TUM using your PILCO implementation. I want to optimize the controller for various target values depending on an input target state. I'm planning to add the difference (of one of the states) of the target state and the momentary value as an extra control dimension. As a result of that the model would be dependant on the states, but the controller would be dependant on the states and e.g. the difference x1_target - x1.
By setting the target value of the extra control dimension zero and putting in data with different target states I should be able to optimize the controller for different inputs. Do you know an easy way to do this or something similar to take account for different targets like e.g. a vehicle controller where you set different curvatures to get a controller.

Thanks for the help, Manuel

kyr-pol commented 4 years ago

Hi @ManuelM95 ,

This a good question, it will take some augmentations on the code base that would be good for the project in general in my opinion.

A simple first approach would be train multiple controllers, one for each distinct task, or subtask, depending on how you want to structure it. This doesn't solve your case, but it might be helpful step.

A similar functionality in the original PILCO implementation allows for multiple starting states, that induce different predicted trajectories, and a single control policy is trained jointly for all cases.

To have a single policy for the distinct targets, you'd have to alter the training process in a similar way. The training is based on predicted trajectories, and the predictions are Gaussian. The extra dimension you want to introduce would have arbitrarily large initial variance, if you want to alter the targets freely. Then, the Gaussian estimate for the next state(s), would also be very uncertain, and planning would be very hard. I think the best approach would be to train on a number of distinct trajectories, corresponding to different targets. If these are reasonably representative of the possible targets, the policy trained on all of them should be able to generalise to new targets too.

To be more specific, one way to implement this, assuming the GP model from mgpr.py remains unchanged would be:

By the way, a good simple case study for this would be the openAI gym Reacher-v2, where a simple robotic arm has to reach a specific target with its end point, and the target varies from episode to episode. It should be a nice minimal example of the functionality you are looking for.

I am also interested in this and will probably try a few things in the next few weeks, keep me posted if you make any progress, and I will mention this issue in any relevant commits. Good luck and have fun!

ManuelM95 commented 4 years ago

Hi @kyr-pol , many thanks for your detailed answer, I was getting nervous because of the lack of progress ;). I will discuss your input with my tutor and check with him how we plan to proceed. I will keep you posted.

Thanks, Manuel

ManuelM95 commented 4 years ago

Hey @kyr-pol , I spoke with my tutor and since my deadline is in 2 months and I also need to write the semester thesis, I won't be able to implement those changes :( .Sorry for that and good luck with the project.

Thanks for the help, Manuel

kyr-pol commented 4 years ago

Ok, no problem, good luck with the thesis!