openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.52k stars 8.59k forks source link

Is MountainCar-v0 harder? #1294

Closed samar1tan closed 5 years ago

samar1tan commented 5 years ago

I've implemented Cross-Entropy method and passed CartPole-v0 and MountainCarContinuous-v0 with the same hyperparameters and default reward definition from gym. But it didn't work on MountainCar-v0 even with unwrapped version and much more episodes taken.

shuruiz commented 5 years ago

If you mean the problem itself, I will say the continuous version is more challenging. Because the action space in that version is continuous. Continuous control is more complex than discrete control. If you mean the difficulty to concur the problem with coding, I will say the MountainCar-v0 is harder. Because in gym environment settings, your action choices are only to push left/not push/push right. That is extremely hard to find an action sequence to achieve the goal because you cannot control your speed acceleration. In the continuous version, you can adjust your speed easily.

Actually, if you train a DQN model, which is the algorithm used in Alphago zero. From my personal experiments, it takes much much longer than traditional rl methods like sarsa(lambda). The explanation that I can come up with is that it is easy for deep neural nets to capture continuous variables but not discrete action sequence, because we usually do not sampling a long future trajectory due to the computation complexity and training time, for example, we usually only do a one-step return when calculating the discounted reward.

To over come this issue, I will suggest you use a longer trajectory (like, for your n-setp return, increase your n) when calculating you reward to lower your bias, though it increases your model variance.

Hope this helps.

samar1tan commented 5 years ago

@shuruiz Thanks!!! I've discovered that early episodes in training are always terminated by MAX_STEP_IN_EPISODE I set but not DONE signal. Maybe a long trajectory and patience are necessary.

sven1977 commented 4 years ago

Yes, setting the MAX_STEP_IN_EPISODE to 5000 or so, helps tremendously. You have to think about the terrible reward function of this env. It's -1 per step and if you have a 200 time-step limit (the default), then your algo will think that the arbitrary actions it picked right before this 200 threshold were actually good ones, b/c the reward then "increased" to 0.0 (everything after the terminal is simulated as r=0.0 in all look-ahead value calculations/estimations/bootstrappings). Also: n-step learning helps tremendously as it does in all goal-reward-only envs.