openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.61k stars 8.6k forks source link

[Question] Documentation for lunar lander rewards incomplete #3014

Closed ToonTalk closed 2 years ago

ToonTalk commented 2 years ago

https://www.gymlibrary.ml/environments/box2d/lunar_lander/#rewards states

Reward for moving from the top of the screen to the landing pad and coming to rest is about 100-140 points.

This is very vague. Is the reward incrementally awarded or only after landing? What determines whether it is 100, 140, or in between?

pseudo-rnd-thoughts commented 2 years ago

The full documentation is

Reward for moving from the top of the screen to the landing pad and coming to rest is about 100-140 points. If the lander moves away from the landing pad, it loses reward. If the lander crashes, it receives an additional -100 points. If it comes to rest, it receives an additional +100 points. Each leg with ground contact is +10 points. Firing the main engine is -0.3 points each frame. Firing the side engine is -0.03 points each frame. Solved is 200 points.

I agree that the documentation is not particular clear but my understanding is As the robot has four legs I believe and the robot receives 10 points per legs and 100 points for landing. Therefore, the total reward for landing is between 100 to 140.

RedTachyon commented 2 years ago

The documentation is somewhat incomplete, but the reward function itself is also a bit complext to put into words. I recommend checking out the source code to see how the reward is computed (it's dependent on the position, velocity, angle, the contact of two legs, the energy usage, and the completion of the objective). If you or someone can convert it to a nice natural language description, that would be great.

vairodp commented 2 years ago

I analyzed how the reward is calculated and I can safely say that the "100..140" is inaccurate and misleading.

The reward takes to account at every step:

(Also, at the end of the episode +- 100 as accurately described)

I tried to dissection in the code how the reward is calculated and I couldn't find evidence of the "100..140" range:

To fix this inaccuracy in the documentation I tried to rewrite the doc in a more "complete" way and already did the PR here