rail-berkeley / rlkit

Collection of reinforcement learning algorithms
MIT License
2.45k stars 550 forks source link

Unusual MountainCarContinuous Results #97

Closed xxmissingnoxx closed 4 years ago

xxmissingnoxx commented 4 years ago

Context:

I attempted to use the SAC example (near identical code below) to make sure I had the environment configured correctly with the MountainCarContinuous-v0 environment with the thought that this might be a relatively quick and small example to examine that does not require a license.

What's the problem?

The agent does not appear to make any progress towards the reward. This environment is somewhat sparse in that a positive reward is only given when the goal is reached (HER + SAC a better candidate perhaps?) and a penalty based on the number of actions used. I would have thought that SAC would have made some progress given its objective function is based on the entropy framework. Am I running it correctly? I've included both the code and the viskit based plot.

Aside:

It's been helpful educational (idioms,best practices,etc.) to read through your code and I appreciate your willingness to share.

#!/usr/bin/env python3
import gym
import rlkit.torch.pytorch_util as ptu
from rlkit.data_management.env_replay_buffer import EnvReplayBuffer
from rlkit.envs.wrappers import NormalizedBoxEnv
from rlkit.launchers.launcher_util import setup_logger
from rlkit.samplers.data_collector import MdpPathCollector
from rlkit.torch.sac.policies import TanhGaussianPolicy, MakeDeterministic
from rlkit.torch.sac.sac import SACTrainer
from rlkit.torch.networks import FlattenMlp
from rlkit.torch.torch_rl_algorithm import TorchBatchRLAlgorithm

def experiment(variant):
    print("Using GPU?: " + str(ptu.gpu_enabled()))
    expl_env = NormalizedBoxEnv(gym.make("MountainCarContinuous-v0"))
    eval_env = NormalizedBoxEnv(gym.make("MountainCarContinuous-v0"))
    obs_dim = expl_env.observation_space.low.size
    action_dim = eval_env.action_space.low.size

    M = variant['layer_size']
    qf1 = FlattenMlp(
        input_size=obs_dim + action_dim,
        output_size=1,
        # First layer nn.Linear(input_size,hidden_sizes[0])
        # 2nd layer nn.Linear(hidden_sizes[0], hidden_sizes[1])
        # Bias and initial Linear weights set with
        #  ptu.fanin_init and .1
        hidden_sizes=[M, M],
    )
    qf2 = FlattenMlp(
        input_size=obs_dim + action_dim,
        output_size=1,
        hidden_sizes=[M, M],
    )
    target_qf1 = FlattenMlp(
        input_size=obs_dim + action_dim,
        output_size=1,
        hidden_sizes=[M, M],
    )
    target_qf2 = FlattenMlp(
        input_size=obs_dim + action_dim,
        output_size=1,
        hidden_sizes=[M, M],
    )
    policy = TanhGaussianPolicy(
        obs_dim=obs_dim,
        action_dim=action_dim,
        hidden_sizes=[M, M],
    )
    eval_policy = MakeDeterministic(policy)
    eval_path_collector = MdpPathCollector(
        eval_env,
        eval_policy,
    )
    expl_path_collector = MdpPathCollector(
        expl_env,
        policy,
    )
    replay_buffer = EnvReplayBuffer(
        variant['replay_buffer_size'],
        expl_env,
    )
    trainer = SACTrainer(
        env=eval_env,
        policy=policy,
        qf1=qf1,
        qf2=qf2,
        target_qf1=target_qf1,
        target_qf2=target_qf2,
        **variant['trainer_kwargs']
    )
    algorithm = TorchBatchRLAlgorithm(
        trainer=trainer,
        exploration_env=expl_env,
        evaluation_env=eval_env,
        exploration_data_collector=expl_path_collector,
        evaluation_data_collector=eval_path_collector,
        replay_buffer=replay_buffer,
        **variant['algorithm_kwargs']
    )
    algorithm.to(ptu.device)
    algorithm.train()

if __name__ == "__main__":
    # noinspection PyTypeChecker
    variant = dict(
        algorithm="SAC",
        version="normal",
        layer_size=256,
        replay_buffer_size=int(1E6),
        algorithm_kwargs=dict(
            num_epochs=3000,
            num_eval_steps_per_epoch=5000,
            num_trains_per_train_loop=1000,
            num_expl_steps_per_train_loop=1000,
            min_num_steps_before_training=1000,
            max_path_length=1000,
            batch_size=256,
        ),
        trainer_kwargs=dict(
            discount=0.99,
            soft_target_tau=5e-3,
            target_update_period=1,
            policy_lr=3E-4,
            qf_lr=3E-4,
            reward_scale=1,
            use_automatic_entropy_tuning=True,
        ),
    )
    setup_logger('example-sac-car', variant=variant,log_dir="./rlkit_out/")
    ptu.set_gpu_mode(True)  # optionally set the GPU (default=False)
    experiment(variant)

newplot (1)

vitchyr commented 4 years ago

It looks like the action penalty is dominating the reward. Not that this is a classic exploration problem, and it's a surprisingly hard environment for methods that don't have very deep exploration strategies. That said, I think if you try turning up the target entropy, that should help with exploration.

On Sat, Feb 8, 2020, 6:22 AM alex notifications@github.com wrote:

Context:

I attempted to use the SAC example (near identical code below) to make sure I had the environment configured correctly with the MountainCarContinuous-v0 https://github.com/openai/gym/wiki/MountainCarContinuous-v0 environment with the thought that this might be a relatively quick and small example to examine that does not require a license.

What's the problem?

The agent does not appear to make any progress towards the reward. This environment is somewhat sparse in that a positive reward is only given when the goal is reached (HER + SAC a better candidate perhaps?) and a penalty based on the number of actions used. I would have thought that SAC would have made some progress given its objective function is based on the entropy framework. Am I running it correctly? I've included both the code and the viskit based plot.

Aside:

It's been helpful educational (idioms,best practices,etc.) to read through your code and I appreciate your willingness to share.

!/usr/bin/env python3import gymimport rlkit.torch.pytorch_util as ptufrom rlkit.data_management.env_replay_buffer import EnvReplayBufferfrom rlkit.envs.wrappers import NormalizedBoxEnvfrom rlkit.launchers.launcher_util import setup_loggerfrom rlkit.samplers.data_collector import MdpPathCollectorfrom rlkit.torch.sac.policies import TanhGaussianPolicy, MakeDeterministicfrom rlkit.torch.sac.sac import SACTrainerfrom rlkit.torch.networks import FlattenMlpfrom rlkit.torch.torch_rl_algorithm import TorchBatchRLAlgorithm

def experiment(variant): print("Using GPU?: " + str(ptu.gpu_enabled())) expl_env = NormalizedBoxEnv(gym.make("MountainCarContinuous-v0")) eval_env = NormalizedBoxEnv(gym.make("MountainCarContinuous-v0")) obs_dim = expl_env.observation_space.low.size action_dim = eval_env.action_space.low.size

M = variant['layer_size']
qf1 = FlattenMlp(
    input_size=obs_dim + action_dim,
    output_size=1,
    # First layer nn.Linear(input_size,hidden_sizes[0])
    # 2nd layer nn.Linear(hidden_sizes[0], hidden_sizes[1])
    # Bias and initial Linear weights set with
    #  ptu.fanin_init and .1
    hidden_sizes=[M, M],
)
qf2 = FlattenMlp(
    input_size=obs_dim + action_dim,
    output_size=1,
    hidden_sizes=[M, M],
)
target_qf1 = FlattenMlp(
    input_size=obs_dim + action_dim,
    output_size=1,
    hidden_sizes=[M, M],
)
target_qf2 = FlattenMlp(
    input_size=obs_dim + action_dim,
    output_size=1,
    hidden_sizes=[M, M],
)
policy = TanhGaussianPolicy(
    obs_dim=obs_dim,
    action_dim=action_dim,
    hidden_sizes=[M, M],
)
eval_policy = MakeDeterministic(policy)
eval_path_collector = MdpPathCollector(
    eval_env,
    eval_policy,
)
expl_path_collector = MdpPathCollector(
    expl_env,
    policy,
)
replay_buffer = EnvReplayBuffer(
    variant['replay_buffer_size'],
    expl_env,
)
trainer = SACTrainer(
    env=eval_env,
    policy=policy,
    qf1=qf1,
    qf2=qf2,
    target_qf1=target_qf1,
    target_qf2=target_qf2,
    **variant['trainer_kwargs']
)
algorithm = TorchBatchRLAlgorithm(
    trainer=trainer,
    exploration_env=expl_env,
    evaluation_env=eval_env,
    exploration_data_collector=expl_path_collector,
    evaluation_data_collector=eval_path_collector,
    replay_buffer=replay_buffer,
    **variant['algorithm_kwargs']
)
algorithm.to(ptu.device)
algorithm.train()

if name == "main":

noinspection PyTypeChecker

variant = dict(
    algorithm="SAC",
    version="normal",
    layer_size=256,
    replay_buffer_size=int(1E6),
    algorithm_kwargs=dict(
        num_epochs=3000,
        num_eval_steps_per_epoch=5000,
        num_trains_per_train_loop=1000,
        num_expl_steps_per_train_loop=1000,
        min_num_steps_before_training=1000,
        max_path_length=1000,
        batch_size=256,
    ),
    trainer_kwargs=dict(
        discount=0.99,
        soft_target_tau=5e-3,
        target_update_period=1,
        policy_lr=3E-4,
        qf_lr=3E-4,
        reward_scale=1,
        use_automatic_entropy_tuning=True,
    ),
)
setup_logger('example-sac-car', variant=variant,log_dir="./rlkit_out/")
ptu.set_gpu_mode(True)  # optionally set the GPU (default=False)
experiment(variant)

[image: newplot (1)] https://user-images.githubusercontent.com/17442967/74086708-7e4df280-4a53-11ea-8c8e-191b95d25619.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vitchyr/rlkit/issues/97?email_source=notifications&email_token=AAJ4VZPBJHWGQDLNLQIYUTDRB25Z7A5CNFSM4KR2YZK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IL72MQA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ4VZPZRCCE5O7FMZTJU5LRB25Z7ANCNFSM4KR2YZKQ .

xxmissingnoxx commented 4 years ago

Thank you! I'll tinker with it. I saw the automatic tuning mentioned in the paper was implemented and thought I'd give it a shot. In practice, do you recommend a hyperparameter tuning tool or approach beyond grid search or trial-and-error?

vitchyr commented 4 years ago

Mostly just grid search. If I had to tune the target entropy, I would look at the automatic target entropy and increase/decrease it by a factor of 10. Note that for continuous action spaces, this target entropy can be negative.

xxmissingnoxx commented 4 years ago

Thank you!