DQN Cartpole not learning

regproj commented 4 years ago

What is the problem?

I just wanted to run some basic tests for DQN on the cartpole environment to check things over before running it on my own environment. I'm wondering if I somehow set up the parameters wrong, as it doesn't seem to be learning. Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.8.4 Tf 1.14.0 Python 3.6 Ubuntu 18.04

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

[ ] I have verified my script runs in a clean environment and reproduces the issue.
[ ] I have verified the issue also occurs with the latest wheels.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
from gym import spaces
import numpy as np
import sys
import os
from gym.envs.registration import register

import ray
import ray.rllib.agents.dqn as dqn 
from ray.tune.logger import pretty_print
from ray.rllib.utils.framework import try_import_tf
from ray.rllib.agents.trainer import with_common_config

ray.init()

NUM_WORKERS = 13
trainer_config = dqn.DEFAULT_CONFIG.copy()
trainer_config["n_step"] = 3
trainer_config["noisy"] = True
trainer_config["num_atoms"] = 51
trainer_config["v_min"] = -1e4
trainer_config["v_max"] = 1e4
trainer_config["buffer_size"] = 2000
trainer_config["target_network_update_freq"] = 10
trainer_config["hiddens"] = [256,128,64]
trainer_config["learning_starts"] = 0
trainer_config["train_batch_size"] = 128
trainer_config['num_workers'] = NUM_WORKERS
trainer_config["timesteps_per_iteration"] = 5*np.clip(NUM_WORKERS,1,9999)
trainer_config["num_gpus"] = 0
trainer_config["num_cpus_per_worker"] = 1
trainer_config["no_done_at_end"] = False
trainer_config["explore"] = True
trainer_config["log_level"] = 'INFO'
trainer_config["clip_actions"] = True
trainer_config["normalize_actions"] = False
trainer_config["sample_async"] = False
trainer_config["evaluation_interval"] = int(1e12)
trainer_config['rollout_fragment_length'] = 1
trainer_config['ignore_worker_failures'] = False
trainer_config["memory_per_worker"] = 0 # memory//(trainer_config['num_workers']+1)
trainer_config["object_store_memory_per_worker"] = 0 # object_store_memory//(trainer_config['num_workers']+1)

MODEL_DEFAULTS = {
    # === Built-in options ===
    # Filter config. List of [out_channels, kernel, stride] for each filter
    "conv_filters": None,
    # Nonlinearity for built-in convnet
    "conv_activation": "relu",
    # Nonlinearity for fully connected net (tanh, relu)
    "fcnet_activation": "relu",
    # Number of hidden layers for fully connected net
    "fcnet_hiddens": [1024,1024,512,512,256],
    # For control envs, documented in ray.rllib.models.Model
    "free_log_std": False,
    # Whether to skip the final linear layer used to resize the hidden layer
    # outputs to size `num_outputs`. If True, then the last hidden layer
    # should already match num_outputs.
    "no_final_linear": False,
    # Whether layers should be shared for the value function.
    "vf_share_layers": True,

    # == LSTM ==
    # Whether to wrap the model with a LSTM
    "use_lstm": False,
    # Max seq len for training the LSTM, defaults to 20
    "max_seq_len": 20,
    # Size of the LSTM cell
    "lstm_cell_size": 256,
    # Whether to feed a_{t-1}, r_{t-1} to LSTM
    "lstm_use_prev_action_reward": False,
    # When using modelv1 models with a modelv2 algorithm, you may have to
    # define the state shape here (e.g., [256, 256]).
    "state_shape": None,

    # == Atari ==
    # Whether to enable framestack for Atari envs
    "framestack": True,
    # Final resized frame dimension
    "dim": 84,
    # (deprecated) Converts ATARI frame to 1 Channel Grayscale image
    "grayscale": False,
    # (deprecated) Changes frame to range from [-1, 1] if true
    "zero_mean": True,

    # === Options for custom models ===
    # Name of a custom model to use
    "custom_model": None,
    # Name of a custom action distribution to use.
    "custom_action_dist": None,

    # Extra options to pass to the custom classes
    "custom_options": {},
    # Custom preprocessors are deprecated. Please use a wrapper class around
    # your environment instead to preprocess observations.
    "custom_preprocessor": None,
}
trainer_config["model"] = MODEL_DEFAULTS

print(trainer_config)

trainer = dqn.DQNTrainer(config = trainer_config, env = 'CartPole-v0');
save_path = 'gibberish'

if os.path.exists(save_path):
    trainer.restore(save_path)

for i in range(1000):
    print("Training iteration {}...".format(i))
    result = trainer.train()
    print(pretty_print(result))
    checkpoint_path = trainer.save()
    print(checkpoint_path)

regproj commented 4 years ago

dqn_cartpole_error

DavidVillero commented 4 years ago

@regproj Have you solved this issue? I am seeing the same problem when I use noisy nets ("noisy"=True): I trained the blue one with e-greedy and the pink one with "noisy"=True. It is worth noting that my configs are exactly the same, except one has "noisy"=True.

               "eager": True,
                "use_exec_api": False,
                # Model
                "num_atoms": 51,    # Distributional RL
                "v_min": -200,     # Distributional RL
                "v_max": 200,      # Distributional RL
                "noisy": True,      # Noisy Nets
                "sigma0": 0.1,      # Noisy Nets (Rainbow paper says use 0.5 if GPU, 0.1 if CPU
                "dueling": True,    # Dueling Network Architecture
                "double_q": True,   # Double DQN (DDQN)
                "hiddens": [2*51],  # num_actions * num_atoms
                "n_step": 3,        # N-step Q Learning / Multi-step Returns

                # Exploration
                # === Exploration Settings (Experimental) ===
                "exploration_config": {
                    # The Exploration class to use.
                    "type": "EpsilonGreedy",
                    # Config for the Exploration class' constructor:
                    "initial_epsilon": 1.0,
                    "final_epsilon": 0.0, #"exploration_final_eps", in older version
                    "epsilon_timesteps":  200000,  # Timesteps over which to anneal epsilon. "exploration_fraction": in older version

                    # For soft_q, use:
                    # "exploration_config" = {
                    #   "type": "SoftQ"
                    #   "temperature": [float, e.g. 1.0]
                    # }
                },
#                 "schedule_max_timesteps": 2000000,  # 2e6
#                 "exploration_fraction": 0.01,   # Not needed when using Noisy Nets
#                 "exploration_final_eps": 0.0,   # Not needed when using Noisy Nets
                "target_network_update_freq": 8192,  # DQN
#                 "soft_q": False,
#                 "softmax_temp": 1.0,
#                 "parameter_noise": False,  # This is NOT Noisy Nets

                # Replay buffer
                "buffer_size": 500000,  # 5e5  # DQN
                "prioritized_replay": True,            # Prioritized Experience Replay
                "prioritized_replay_alpha": 0.5,       # Prioritized Experience Replay
                "prioritized_replay_beta": 0.4,        # Prioritized Experience Replay
#                 "beta_annealing_fraction": 1.0,        # Prioritized Experience Replay
                "prioritized_replay_beta_annealing_timesteps": 2000000,
                "final_prioritized_replay_beta": 1.0,  # Prioritized Experience Replay
                "prioritized_replay_eps": 1e-6,        # Prioritized Experience Replay
                "compress_observations": True,

                # Optimization
                "gamma": 0.99,
                "lr": 1e-4,
                # "lr_schedule": None,
                "adam_epsilon": 1.5e-4,
#                 "grad_clip": 40,
                "learning_starts": 20000,
                "rollout_fragment_length": 4,
                "train_batch_size": 32,
                "timesteps_per_iteration": 200,

                # Parallelism
                "num_workers": 0,
#                 "optimizer_class": "SyncReplayOptimizer",
#                 "per_worker_exploration": False,
                "worker_side_prioritization": False,
                "min_iter_time_s": 1.,

stale[bot] commented 3 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 3 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

ray-project / ray

DQN Cartpole not learning #8173

What is the problem?

Reproduction (REQUIRED)