ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.92k stars 5.77k forks source link

[rllib] Same environment performs differently when used as customized env #7691

Closed rl-2 closed 4 years ago

rl-2 commented 4 years ago

What is the problem?

Ray version: 0.7.5 Gym version: 0.17.0 Python version: 3.6.10 TensorFlow version: 1.10.0 OS: macOS Mojave 10.14.6

I used MountainCarContinuous-v0 as a customized environment but got different results than using it as an internal environment. The environment was not modified at all and the hyper-parameters are taken from mountaincarcontinuous-ddpg.yaml in the tuned examples. The policy can still be trained. The detailed results are below:

Use as internal environment:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 1/12 CPUs, 0/0 GPUs, 0.0/3.76 GiB heap, 0.0/1.27 GiB objects
Memory usage on this node: 10.1/16.0 GiB
Result logdir: /Users/luor/ray_results/mountaincarcontinuous-ddpg
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - DDPG_MountainCarContinuous-v0_0: RUNNING, [1 CPUs, 0 GPUs], [pid=48031], 1 s, 1 iter, 1011 ts, -32.2 rew

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 1/12 CPUs, 0/0 GPUs, 0.0/3.76 GiB heap, 0.0/1.27 GiB objects
Memory usage on this node: 10.1/16.0 GiB
Result logdir: /Users/luor/ray_results/mountaincarcontinuous-ddpg
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - DDPG_MountainCarContinuous-v0_0: RUNNING, [1 CPUs, 0 GPUs], [pid=48031], 6 s, 3 iter, 3015 ts, 10.6 rew

Use as customized environment:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 1/12 CPUs, 0/0 GPUs, 0.0/3.76 GiB heap, 0.0/1.27 GiB objects
Memory usage on this node: 10.1/16.0 GiB
Result logdir: /Users/luor/ray_results/mountaincarcontinuous-ddpg
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - DDPG_MountainCar_0:  RUNNING, [1 CPUs, 0 GPUs], [pid=48159], 1 s, 1 iter, 1005 ts, nan rew

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 1/12 CPUs, 0/0 GPUs, 0.0/3.76 GiB heap, 0.0/1.27 GiB objects
Memory usage on this node: 10.0/16.0 GiB
Result logdir: /Users/luor/ray_results/mountaincarcontinuous-ddpg
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - DDPG_MountainCar_0:  RUNNING, [1 CPUs, 0 GPUs], [pid=48159], 6 s, 3 iter, 3009 ts, 66.4 rew
ericl commented 4 years ago

Maybe you didn't set TimeLimit around the env? gym.make() adds a lot of these wrappers automatically that you don't get with a custom class.

Quick check would be to use gym.make() inside your custom env to create the MountainCar inner env, which should then behave identically.

rl-2 commented 4 years ago

Thanks, @ericl ! That's exactly the problem.