ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

Pendulum doesn't appear to be learning #1551

Closed eugenevinitsky closed 6 years ago

eugenevinitsky commented 6 years ago

System information

Describe the problem

Pendulum doesn't appear to be learning. The following script doesn't show any improvement after 100 iterations.

Source code / logs

import ray
import ray.rllib.ppo as ppo

ray.init()
config = ppo.DEFAULT_CONFIG.copy()
config["use_gae"] = False
config["clip_param"] = 0.2
config["sgd_stepsize"] = 5e-5
config["timesteps_per_batch"] = 10000
agent = ppo.PPOAgent(config=config, env="Pendulum-v0")

for i in range(1000):
   result = agent.train()
   print("result: {}".format(result))

   if i % 100 == 0:
       checkpoint = agent.save()
       print("checkpoint saved at", checkpoint)
eugenevinitsky commented 6 years ago

Cartpole does work under similar configs so maybe it's just a tuning issue?

ericl commented 6 years ago

I can confirm Pendulum doesn't seem to train. This is actually true for A3C, ES, and plain PG, which may indicate there's a common issue, perhaps the action distribution or default params for the network architecture.

Other envs seem to work fine though (e.g. Humanoid, Cartpole, Pong). I don't think we've ever tested on Pendulum-v0 so this may have always been an issue.

ericl commented 6 years ago

@richardliaw found some hyperparams that worked: ./train.py --env=Pendulum-v0 --run=PPO --config='{"timesteps_per_batch": 2048, "lambda": 0.1, "gamma": 0.95, "sgd_stepsize": 0.0003, "sgd_batchsize": 64, "num_sgd_iter": 10, "model": {"fcnet_hiddens": [64, 64]}, "min_steps_per_task": 100}'

I suspect the discount in particular might make a big difference here. Learning curve:

test

eugenevinitsky commented 6 years ago

Thanks @richardliaw! We were concerned because some of our more complicated single-agent experiments that worked on our reinforcement learning libraries do not seem to be learning (albeit we were using TRPO for those) and so when pendulum wasn't working we were worried there might be a more fundamental problem.