openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.64k stars 4.86k forks source link

Customize reward function with DDPG+HER #720

Open WuXinyang2012 opened 5 years ago

WuXinyang2012 commented 5 years ago

Hi guys, I am playing DDPG+HER on the FetchReach-v1 environment.

I want to train an agent which could reach the desired goal while avoiding collision.

I add a geometry obstacle in the reach.xml and implement a collision detector by monitoring the contact force. And I modify the reward function defined in https://github.com/openai/gym/blob/master/gym/envs/robotics/fetch_env.py#L53, make it give a -10 reward every time when a collision is detected.

However, the problem is in HER algorithm, when preparing the replay buffer, it will substitute the desired goal with the achieved goal and recompute the reward, during which the collision info is lost and the -10 reward is also lost since the collision point is considered as the new desired goal. And it still only learns how to reach a given goal, without considerations to the collision.

Can anyone give me some hints on how to modify the HER algorithm to make it work with my customized rewards? And keep the collision trajectory with -10 rewards.

pzhokhov commented 5 years ago

I am not sure I understand why HER would forget about collisions... If the negative reward due to collision is independent of desired goal and achieved goal, whenever transition that leads to collision is sampled from the buffer, it should return that negative collision reward, so in principle agent should learn to avoid that. However, I can see how this would happen if positive reward for reaching the goal is larger than collision penalty - in that case, for transitions in which we replace desired goal with achieved goal agent would be happy even if it bumped into an obstacle. So one (admittedly very naive) thing to try is to increase absolute value of the negative collision penalty.

Another way I can think of (but it achieves slightly different result) is to end the episode when collision is detected (and provide reward -10 at the collision instead of positive reward of reaching the goal).

Putting @mandrychowicz @fjwolski @machinaut @wojzaremba into the loop on this as authors of HER for more principled advice.

WuXinyang2012 commented 5 years ago

Some Updates:

finally, it works now. The best performance is reached when the negative reward is -5 for collision.

Not only adding the negative reward, I also added the position of the obstacle into the observations to make the agent aware of it. Now it works perfectly with a static obstacle + dynamic goal. The agent would plan a shortest path towards the goal while assuring no collision with the obstacle.

However, for the dynamic obstacle + dynamic goal, it still contains collision... Can any guys give me some hints?

Basically since adding obstacle into observations can make the agent work perfectly with a static obstacle, so I would say the problem with the dynamic obstacle is due to lacking of experience, or underfitting? So shall I increase the network layers and n_cycles during training?

I am also thinking to make a similar HER implementation but for the obstacle, to make the agent learn much more about the dynamic obstacle. Do any guys have any interesting suggestions?

P.S.: I also tried shaped rewards but not work well.

poliandre98 commented 3 years ago

Some Updates:

finally, it works now. The best performance is reached when the negative reward is -5 for collision.

Not only adding the negative reward, I also added the position of the obstacle into the observations to make the agent aware of it. Now it works perfectly with a static obstacle + dynamic goal. The agent would plan a shortest path towards the goal while assuring no collision with the obstacle.

However, for the dynamic obstacle + dynamic goal, it still contains collision... Can any guys give me some hints?

Basically since adding obstacle into observations can make the agent work perfectly with a static obstacle, so I would say the problem with the dynamic obstacle is due to lacking of experience, or underfitting? So shall I increase the network layers and n_cycles during training?

I am also thinking to make a similar HER implementation but for the obstacle, to make the agent learn much more about the dynamic obstacle. Do any guys have any interesting suggestions?

P.S.: I also tried shaped rewards but not work well.

Hi @WuXinyang2012, I'have read that you successfully trained the agent avoiding collisions. I'm new in Reinforcement Learning and I'm trying to train my own reach scene managing also the collision. I have tried to perform the training with her algorithm, however the training lasts too long (more or less 20 minutes per epoch both with --num_timesteps=1e6 and --num_timesteps= 300k. Could you share some advices to perform the training? The values in experiment--> config.py might also be useful to me in order to understand how to perform a good training. Thank you in advance. Best Regards, Andrea.