stanfordnmbl / osim-rl

Reinforcement learning environments with musculoskeletal models
http://osim-rl.stanford.edu/
MIT License
877 stars 248 forks source link

Is this reward function good for competition evaluation? #201

Open luckeciano opened 4 years ago

luckeciano commented 4 years ago

Hey guys,

I would like to add a concern regarding the reward function.

After some analysis, I think it can be easily exploited for controllers that does not walk. Basically, the positive reward comes from the alive bonus and from footstep duration. An agent can just perform footsteps with no pelvis velocity (maintaining its initial position), or even just perform a long footstep from the beginning of the episode until the end without changing its position. In this way, the penalization is very low (the effort is low and there is no penalization from deviation because in the initial position Vtgt is a null vector).

As the objective of the competition is to learn to effectively walk following the navigation field, I think the reward function should be modified. My first thought is to add another factor that reinforces the idea of move. What do you guys think?

smsong commented 4 years ago

@luckeciano May you elaborate on v_tgt being null at the initial position? How did you get this null vector?

luckeciano commented 4 years ago

Hey @smsong,

Actually, I commited a mistake. The v_tgt is not null at initial position (I saw a point in the map, but there is an arrow as well). I'm sorry.

However, I printed the components from footstep reward and in this situation, the penalization is very low when compared with the total reward of just give a long footstep during the episode. In one of my tests, my agent did a single footstep, obtaining 47 of reward, losing only ~10 from effort and velocity deviation.

Therefore, it is possible to obtain almost all the possible reward without leaving the initial position. I think the reward should be modified - at least the weights. Otherwise, there is a possibility of top submissions without any walk motion.

smsong commented 4 years ago

@luckeciano Thanks for the clarification and suggestion. However, if a network exploits this single footstep solution you've mentioned, it would probably get stuck at local minima and will not be able to compete with good solutions. And it is possible that some participants already got around this issue by using different rewards to first train a good network then fine-tune for the given reward. So it may be unfair to change the reward at this point. A systematic investigation on rewards that facilitate training can be an interesting study ;)