Closed xxmissingnoxx closed 4 years ago
It looks like the action penalty is dominating the reward. Not that this is a classic exploration problem, and it's a surprisingly hard environment for methods that don't have very deep exploration strategies. That said, I think if you try turning up the target entropy, that should help with exploration.
On Sat, Feb 8, 2020, 6:22 AM alex notifications@github.com wrote:
Context:
I attempted to use the SAC example (near identical code below) to make sure I had the environment configured correctly with the MountainCarContinuous-v0 https://github.com/openai/gym/wiki/MountainCarContinuous-v0 environment with the thought that this might be a relatively quick and small example to examine that does not require a license.
What's the problem?
The agent does not appear to make any progress towards the reward. This environment is somewhat sparse in that a positive reward is only given when the goal is reached (HER + SAC a better candidate perhaps?) and a penalty based on the number of actions used. I would have thought that SAC would have made some progress given its objective function is based on the entropy framework. Am I running it correctly? I've included both the code and the viskit based plot.
Aside:
It's been helpful educational (idioms,best practices,etc.) to read through your code and I appreciate your willingness to share.
!/usr/bin/env python3import gymimport rlkit.torch.pytorch_util as ptufrom rlkit.data_management.env_replay_buffer import EnvReplayBufferfrom rlkit.envs.wrappers import NormalizedBoxEnvfrom rlkit.launchers.launcher_util import setup_loggerfrom rlkit.samplers.data_collector import MdpPathCollectorfrom rlkit.torch.sac.policies import TanhGaussianPolicy, MakeDeterministicfrom rlkit.torch.sac.sac import SACTrainerfrom rlkit.torch.networks import FlattenMlpfrom rlkit.torch.torch_rl_algorithm import TorchBatchRLAlgorithm
def experiment(variant): print("Using GPU?: " + str(ptu.gpu_enabled())) expl_env = NormalizedBoxEnv(gym.make("MountainCarContinuous-v0")) eval_env = NormalizedBoxEnv(gym.make("MountainCarContinuous-v0")) obs_dim = expl_env.observation_space.low.size action_dim = eval_env.action_space.low.size
M = variant['layer_size'] qf1 = FlattenMlp( input_size=obs_dim + action_dim, output_size=1, # First layer nn.Linear(input_size,hidden_sizes[0]) # 2nd layer nn.Linear(hidden_sizes[0], hidden_sizes[1]) # Bias and initial Linear weights set with # ptu.fanin_init and .1 hidden_sizes=[M, M], ) qf2 = FlattenMlp( input_size=obs_dim + action_dim, output_size=1, hidden_sizes=[M, M], ) target_qf1 = FlattenMlp( input_size=obs_dim + action_dim, output_size=1, hidden_sizes=[M, M], ) target_qf2 = FlattenMlp( input_size=obs_dim + action_dim, output_size=1, hidden_sizes=[M, M], ) policy = TanhGaussianPolicy( obs_dim=obs_dim, action_dim=action_dim, hidden_sizes=[M, M], ) eval_policy = MakeDeterministic(policy) eval_path_collector = MdpPathCollector( eval_env, eval_policy, ) expl_path_collector = MdpPathCollector( expl_env, policy, ) replay_buffer = EnvReplayBuffer( variant['replay_buffer_size'], expl_env, ) trainer = SACTrainer( env=eval_env, policy=policy, qf1=qf1, qf2=qf2, target_qf1=target_qf1, target_qf2=target_qf2, **variant['trainer_kwargs'] ) algorithm = TorchBatchRLAlgorithm( trainer=trainer, exploration_env=expl_env, evaluation_env=eval_env, exploration_data_collector=expl_path_collector, evaluation_data_collector=eval_path_collector, replay_buffer=replay_buffer, **variant['algorithm_kwargs'] ) algorithm.to(ptu.device) algorithm.train()
if name == "main":
noinspection PyTypeChecker
variant = dict( algorithm="SAC", version="normal", layer_size=256, replay_buffer_size=int(1E6), algorithm_kwargs=dict( num_epochs=3000, num_eval_steps_per_epoch=5000, num_trains_per_train_loop=1000, num_expl_steps_per_train_loop=1000, min_num_steps_before_training=1000, max_path_length=1000, batch_size=256, ), trainer_kwargs=dict( discount=0.99, soft_target_tau=5e-3, target_update_period=1, policy_lr=3E-4, qf_lr=3E-4, reward_scale=1, use_automatic_entropy_tuning=True, ), ) setup_logger('example-sac-car', variant=variant,log_dir="./rlkit_out/") ptu.set_gpu_mode(True) # optionally set the GPU (default=False) experiment(variant)
[image: newplot (1)] https://user-images.githubusercontent.com/17442967/74086708-7e4df280-4a53-11ea-8c8e-191b95d25619.png
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vitchyr/rlkit/issues/97?email_source=notifications&email_token=AAJ4VZPBJHWGQDLNLQIYUTDRB25Z7A5CNFSM4KR2YZK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IL72MQA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ4VZPZRCCE5O7FMZTJU5LRB25Z7ANCNFSM4KR2YZKQ .
Thank you! I'll tinker with it. I saw the automatic tuning mentioned in the paper was implemented and thought I'd give it a shot. In practice, do you recommend a hyperparameter tuning tool or approach beyond grid search or trial-and-error?
Mostly just grid search. If I had to tune the target entropy, I would look at the automatic target entropy and increase/decrease it by a factor of 10. Note that for continuous action spaces, this target entropy can be negative.
Thank you!
Context:
I attempted to use the SAC example (near identical code below) to make sure I had the environment configured correctly with the MountainCarContinuous-v0 environment with the thought that this might be a relatively quick and small example to examine that does not require a license.
What's the problem?
The agent does not appear to make any progress towards the reward. This environment is somewhat sparse in that a positive reward is only given when the goal is reached (HER + SAC a better candidate perhaps?) and a penalty based on the number of actions used. I would have thought that SAC would have made some progress given its objective function is based on the entropy framework. Am I running it correctly? I've included both the code and the viskit based plot.
Aside:
It's been helpful educational (idioms,best practices,etc.) to read through your code and I appreciate your willingness to share.