hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.12k stars 724 forks source link

question: multiple reward array ? #841

Open greg2paris opened 4 years ago

greg2paris commented 4 years ago

Is is possible to make an array of different reward? for example, If we have a robot who have to walk and in the same time looking around with his camera, we could separate the rewards into an array to tell the agent that he is doing a good job at walking but not at looking around. So the learning process will be faster than to have only one reward. another example would be to trade multiple assets in the same time. if the agent is doing completly random action, it is very unlickly his balance will be positive, and it would take lots of time to train. but if the agent knows that one of the trade he made, was positive, he may learn to trade better and faster. is it possible to make? or should I divide my environment in multiple "little" environment?

Miffyli commented 4 years ago

This is outside stable-baselines, and the library does support such setup. The closest approach is to combine the multiple rewards (sum or weighted sum) into one, and train on that. You might want to look up terms like "multi-task learning" to learn more into this.

You may close this issue if there are no other questions related to stable-baselines.

Mohamedgalil commented 4 years ago

Nice idea of structuring the rewards. Actually a similar idea is already published in Hybrid Reward Architecture for RL with which they were the first to solve Pacman. We will present a similar work on safe human-robots Collaboration using DRL in ICRA 2020 (where we define a reward for reaching the goal, and a reward for avoiding obstacles). In order to extend stable_baselines to include this, you need to:

  1. Modify the reward function in your environment to return a vector (not a scalar)
  2. If you are using Actor-Critic architecture, you need to modify only the Critic Architecture such that the Q-value is having the same dimensions as the reward (similar to defining the input dimensions of the Critic and the actor based on the dimensions of the observation and action spaces)
  3. If you are using a replay buffer (off-policy): you need to modify the size of the reward there as well. We applied it successfully on DDPG, HER+DDPG and PPO. In our cases, not much performance gains were seen by HRA, but HRA shows a small improvement in the sample efficiency and the stability of learnt behaviour.
madhekar commented 1 year ago

Is there any progress on this in stable-baselines3? i.e handling reward vector!