ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.98k stars 5.77k forks source link

Bad performance of LSTM policies with PPO #5278

Closed alex-petrenko closed 4 years ago

alex-petrenko commented 5 years ago

System information

Problem

LSTM policies can't match the performance of feed-forward policies even on tasks where having state (memory) gives an advantage

image

Here orange agent is feed-forward, blue is RNN. Policies are trained for the same amount of wall-time. In this particular task the agent has to remember the color of the object in the middle of the room and collect objects of the corresponding color to maximize reward. This is not a standard environment, but I've observed similar performance gap across many task and hyperparam combinations. This video demonstrates the task: https://github.com/edbeeching/3d_control_deep_rl/blob/master/videos/two_color_example.gif

Here's my current config:

env: doom_two_colors_fixed
run: PPO
config:
    lr: 0.0001
    lambda: 0.95
    kl_coeff: 0.5
    clip_rewards: False
    clip_param: 0.1
    vf_clip_param: 100000.0
    entropy_coeff: 0.0005
    train_batch_size: 6144
    sample_batch_size: 64
    sgd_minibatch_size: 512
    num_sgd_iter: 4
    num_workers: 18
    num_envs_per_worker: 8
    batch_mode: truncate_episodes
    observation_filter: NoFilter
    vf_share_layers: true
    num_gpus: 1
    model:
        custom_model: vizdoom_vision_model
        conv_filters: [
            [32, [8, 8], 4],
            [64, [4, 4], 2],
            [64, [3, 3], 2],
            [128, [3, 3], 2],
        ]
        conv_activation: elu
        fcnet_activation: elu  # was tanh

        use_lstm: True
        max_seq_len: 32
        lstm_cell_size: 256
        lstm_use_prev_action_reward: False
        framestack: False
        grayscale: False
        zero_mean: False

I tried many hyperparameter combinations and it didn't lead to improvement: larger minibatch size, set kl_coeff to 0, change entropy term, sgd_iter, larger max_seq_len.

I might be missing something important, so suggestions are very welcome!

ericl commented 5 years ago

Is it possible to provide a reproduction script?

alex-petrenko commented 5 years ago

Is it possible to provide a reproduction script?

Would it work if I sent you my repository with installation instructions and the script to reproduce this? This example uses a custom environment that requires VizDoom and my Gym wrapper.

I'm not sure if this can be reproduced on any of the standard environments such as Atari because they are all pretty much MDPs (except those we can't solve anyway).

ericl commented 5 years ago

Sure, that would work great!

ericl commented 5 years ago

By the way, one thing I noticed about your config is that vf_share_layers=True, but you don't tune vf_loss_coeff. In some other LSTM tests I've found the VF loss weight has a huge impact on performance, so you might want to reduce that to e.g., 1e-2 to 1e-5 range.

This is especially problematic if your rewards are large since the VF loss is error^2, so it can be really out of scale compared to the policy loss.

josjo80 commented 5 years ago

I'm also seeing similar results in the StarCraft2 SMAC environment. LSTM based policy networks seem to train much more slowly. I chalked it up to larger batch sizes which seem to be needed with a large max_seq_len. Unfortunately, I can't make a lot of headway to sweep the hyperparameters because I get a stack overflow during training. I did try reducing the vf_loss_coeff to 1e-3 and that seemed to help until, again, the stack overflow issue hit again. I'm also having issues with using all 4 GPUs. Even though I set num_gpus to 4, it will only use 1.

alex-petrenko commented 5 years ago

Sure, that would work great!

Eric, I'm really sorry, got hit by a bunch of deadlines, struggling to find time to look at this again. The one thing I did was to sweep vf_loss_coeff from 1e-4 to 1.0, and it didn't seem to have a huge impact. Although I understand that you want to have policy and value loss around the same scale.

ericl commented 5 years ago

Here's some data points for PPO and IMPALA on Breakout:

image

Overall it seems the LSTM policy is able to be successful on Breakout, though not quite as fast as the framestacked solution. Interestingly, the choice of optimizer has a huge impact: for IMPALA Adam flatlines, while for PPO it's RMSprop that doesn't work. I don't think it's surprising that a LSTM policy takes longer to learn, but it does seem to be much more brittle with respect to hyperparameters.

The full hyperparameters:

atari-impala:
    env: BreakoutNoFrameskip-v4
    run: IMPALA
    config:
        sample_batch_size: 50
        train_batch_size: 500
        num_workers: 32
        num_envs_per_worker: 5
        clip_rewards: True
        lr_schedule: [
            [0, 0.0005],
            [20000000, 0.000000000001],
        ]
        opt_type:
            grid_search:
                - adam
                - rmsprop
        model:
            conv_activation: elu
            framestack: false
            use_lstm: true
atari-ppo:
    env: BreakoutNoFrameskip-v4
    run: PPO
    config:
        lambda: 0.95
        kl_coeff: 0.5
        clip_rewards: True
        clip_param: 0.1
        vf_clip_param: 10.0
        entropy_coeff: 0.01
        train_batch_size: 5000
        sample_batch_size: 100
        sgd_minibatch_size: 500
        num_sgd_iter: 10
        num_workers: 10
        num_envs_per_worker: 5
        batch_mode: truncate_episodes
        observation_filter: NoFilter
        vf_share_layers: true
        num_gpus: 1
        model:
            conv_activation: elu
            framestack: false
            use_lstm: true
alex-petrenko commented 5 years ago

Thank you for posting this, very interesting! I would not read into it too much though, especially since the observations are so random and unpredictable, e.g. RMSProp vs Adam performance in PPO & IMPALA. It might be just because of random seed or other subtle effects.

ericl commented 5 years ago

I let it run for longer, and it looks like IMPALA + LSTM don't quite get to as good rewards as a feedforward policy. PPO also behaves similarly.

Screenshot from 2019-08-08 15-46-23

alex-petrenko commented 5 years ago

I would expect if from PPO, because I encountered this before with other implementations too. Might be because PPO does many SGD steps on a batch which can affect the distribution of hidden states too significantly. But for IMPALA there should be no such effect. In the paper they don't compare performance of feed-forward agent vs LSTM, so we may never know if this is a method vs implementation issue.

On particular comment I had about RNN implementation in RLLIB is the wall-time performance. RLLIB uses a different approach to most other implementations I've seen: it zero-pads mini-batches if the episode termination occurs in the middle of the rollout. This happens in chop_into_sequences() in lstm.py. In my project, I use pretty big input observations (128x72x3) and for me this function was a bottleneck, taking approximately as much as the backprop. In turn, the throughput of PPO or APPO with RNN was at least 2-3 times worse compared to feedforward case. There might be other inefficiencies too.

My own pytorch implementation just zeroes out the hidden state on the episode boundary, without the need to copy and rearrange the experience batch, and its throughput is only ~10% worse than feed-forward. Might be something to look into.

ericl commented 5 years ago

Yeah, there is the potential for speedup by moving the padding into TF with tf.scatter_nd: https://github.com/ray-project/ray/issues/2992

I had that implemented at some point but it complicated PPO minibatching quite a bit, but it probably does speed up IMPALA: https://github.com/ray-project/ray/compare/master...ericl:fix-lstm-seq?expand=1

josjo80 commented 5 years ago

I upgraded last week to the latest version of RLlib (0.7.3) and neither feedforward nor LSTM models do as well as before. I was running 0.7.0 beforehand. I know there was an upgrade to tensorflow 2.0, so I upgraded my models to be compatible. Was there something else in the config files that changed? I've swept a significant number of the hyperparameters and I just can't replicate the results from before. I'm working with the SMAC environment (version 0.1.0b1). The upper orange and red plots used an LSTM model from 0.7.0 and the bottom red plot used an LSTM from 0.7.3. The blue and green used a feedforward from 0.7.0 while the magenta is a feedforward from 0.7.3.

Overall, the new FF learns more slowly (on par with LSTM from before) and the new LSTM learns much more slowly.

V2_results

ericl commented 5 years ago

Is it possible to replicate on a gym environment? As far as I know we didn't make any behaviour changes to PPO beyond the refactoring.

You can also try git bisecting to find the problematic commit.

On Thu, Aug 15, 2019, 1:55 PM Joshua Johnson notifications@github.com wrote:

I upgraded last week to the latest version of RLlib (0.7.3) and neither feedforward nor LSTM models do as well as before. I was running 0.7.0 beforehand. I know there was an upgrade to tensorflow 2.0, so I upgraded my models to be compatible. Was there something else in the config files that changed? I've swept a significant number of the hyperparameters and I just can't replicate the results from before. I'm working with the SMAC environment (version 0.1.0b1). The orange and red plots below used an LSTM model from 0.7.0 and the red plot used an LSTM from 0.7.3. The blue and green used a feedforward from 0.7.0 while the magenta is a feedforward from 0.7.3.

[image: V2_results] https://user-images.githubusercontent.com/13140602/63126191-8e858900-bf6c-11e9-8f9f-298b00fce027.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/5278?email_source=notifications&email_token=AAADUSRWBB3UQ2HJDCQNLU3QEW7FJA5CNFSM4IG7TFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4M7FNA#issuecomment-521794228, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSSWQDCB2XZVDYPW7LDQEW7FJANCNFSM4IG7TFHQ .

josjo80 commented 5 years ago

I'm running BeamRiderNoFrameskip-v4 right now with PPO on RLlib 0.7.3. Should take about a day to get to 10M timesteps on my 4 RTX GPUs.

In the meantime would you mind taking a look at my model and training script config? I followed your new designs of recurrent_tf_modelv2.py and the fcnet_v2.py as well as the parametric_action_cartpole.py example. I've implemented these models using the parametric action selection but am not sure that the value_function is now getting passed correctly. There's also a curious bug when defining the input layer now for the models. When I pass in input_layer = tf.keras.layers.Input(shape=(None, obs_space.shape[0])) the SMAC environment returns a different dimension than when it is running during training. This was not an issue in RLlib 0.7.0 since I didn't have to really define the input_layer dimensions ahead of time when using tf.slim layers.

smac_ppo.zip

p.s. I'm not familiar with git bisecting. Would be interesting in learning more.

ericl commented 5 years ago

The value handling looks ok to me. Maybe it makes sense to use a separate LSTM network for the value function entirely, to avoid issues with the shared loss?

Btw, I think you don't need the None / batch dim in the input shape, though it shouldn't matter either way.

Eric

On Fri, Aug 16, 2019 at 8:26 AM Joshua Johnson notifications@github.com wrote:

I'm running BeamRiderNoFrameskip-v4 right now with PPO on RLlib 0.7.3. Should take about a day to get to 10M timesteps on my 4 RTX GPUs.

In the meantime would you mind taking a look at my model and training script config? I followed your new designs of recurrent_tf_modelv2.py and the fcnet_v2.py as well as the parametric_action_cartpole.py example. I've implemented these models using the parametric action selection but am not sure that the value_function is now getting passed correctly. There's also a curious bug when defining the input layer now for the models. When I pass in input_layer = tf.keras.layers.Input(shape=(None, obs_space.shape[0])) the SMAC environment returns a different dimension than when it is running during training. This was not an issue in RLlib 0.7.0 since I didn't have to really define the input_layer dimensions ahead of time when using tf.slim layers.

smac_ppo.zip https://github.com/ray-project/ray/files/3510183/smac_ppo.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/5278?email_source=notifications&email_token=AAADUSQJXLG2NYG6TKWIQALQE3BI3A5CNFSM4IG7TFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4O5NSI#issuecomment-522049225, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSUPCKPUPMVT4KY5SJDQE3BI3ANCNFSM4IG7TFHQ .

josjo80 commented 5 years ago

Ok. Thanks for the feedback. I was uncertain as to whether I was creating the models correctly in the new RLlib version.

BTW, after about 8hrs below is the results of running PPO on BeamRiderNoFrameskip-v4. I used the same config params as what you published. I'm starting to think there's an issue with PPO.

PPO_BeamRiderNoFrameskip-v4

Config:

run_experiments({
            "ppo_atari": {
                "run": "PPO",
                "env": "BeamRiderNoFrameskip-v4",
                "stop": {
                    "training_iteration": args.num_iters,
                },
                "config": {
                    "num_workers": args.num_workers,
                    "num_envs_per_worker": args.num_envs_per_worker,
                    "num_gpus": args.num_gpus,
                    "train_batch_size": 5000,
                    "sgd_minibatch_size": 500,  #Remove for APPO
                    "sample_batch_size": 100,    #Add for APPO, remove for PPO
                    "lr": 1e-4,
                    "lambda": .95,
                    "kl_coeff": 0.5,    #Remove for APPO
                    "clip_rewards": True,
                    "clip_param": 0.1,
                    "vf_clip_param": 10.0,
                    "entropy_coeff": 0.01,
                    "num_sgd_iter": 10,
                    "batch_mode": "truncate_episodes",
                    "observation_filter": "NoFilter",  # breaks the action mask
                    "vf_share_layers": True,  # don't create a separate value model (remove for APPO)
                    #"vf_loss_coeff": 1e-3,    #VF loss is error^2, so it can be really out of scale compared to the policy loss. 
                                              #Ref: https://github.com/ray-project/ray/issues/5278

                },
            },
         })
ericl commented 5 years ago

Hm on master I am getting this after 2.5M timesteps (which took 30m btw on a V100 -- not sure if that's supposed to be 50x faster than RTX or there's something config issue on your side): RUNNING trials:

So breakout and qbert look better than in rl-experiments, but beamrider a bit worse. I'll also try on 0.7.3.

Eric

On Fri, Aug 16, 2019 at 4:04 PM Joshua Johnson notifications@github.com wrote:

Ok. Thanks for the feedback. I was uncertain as to whether I was creating the models correctly in the new RLlib version.

BTW, after about 8hrs below is the results of running PPO on BeamRiderNoFrameskip-v4. I used the same config params as what you published https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/atari-ppo.yaml. I'm starting to think there's an issue with PPO.

[image: PPO_BeamRiderNoFrameskip-v4] https://user-images.githubusercontent.com/13140602/63202618-bd246200-c047-11e9-8125-02ee159063ef.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/5278?email_source=notifications&email_token=AAADUSTDOES5AZWBDL2T6VDQE4W6RA5CNFSM4IG7TFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4P4QMA#issuecomment-522176560, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSQEH3BJ2YYWNIOSBLLQE4W6RANCNFSM4IG7TFHQ .

ericl commented 5 years ago

master:

0.7.3:

So seems about on par with previous results. Btw, this is with TF 1.13.

Eric

On Fri, Aug 16, 2019 at 5:00 PM Eric Liang ekhliang@gmail.com wrote:

Hm on master I am getting this after 2.5M timesteps (which took 30m btw on a V100 -- not sure if that's supposed to be 50x faster than RTX or there's something config issue on your side): RUNNING trials:

  • PPO_BreakoutNoFrameskip-v4_0_env=BreakoutNoFrameskip-v4: RUNNING, [11 CPUs, 1 GPUs], [pid=36434], 2052 s, 566 iter, 2830000 ts, 38.6 rew
  • PPO_BeamRiderNoFrameskip-v4_1_env=BeamRiderNoFrameskip-v4: RUNNING, [11 CPUs, 1 GPUs], [pid=36461], 2051 s, 570 iter, 2850000 ts, 902 rew
  • PPO_QbertNoFrameskip-v4_2_env=QbertNoFrameskip-v4: RUNNING, [11 CPUs, 1 GPUs], [pid=36398], 2051 s, 570 iter, 2850000 ts, 4.56e+03 rew
  • PPO_SpaceInvadersNoFrameskip-v4_3_env=SpaceInvadersNoFrameskip-v4: RUNNING, [11 CPUs, 1 GPUs], [pid=36411], 2054 s, 571 iter, 2855000 ts, 495 rew

So breakout and qbert look better than in rl-experiments, but beamrider a bit worse. I'll also try on 0.7.3.

Eric

On Fri, Aug 16, 2019 at 4:04 PM Joshua Johnson notifications@github.com wrote:

Ok. Thanks for the feedback. I was uncertain as to whether I was creating the models correctly in the new RLlib version.

BTW, after about 8hrs below is the results of running PPO on BeamRiderNoFrameskip-v4. I used the same config params as what you published https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/atari-ppo.yaml. I'm starting to think there's an issue with PPO.

[image: PPO_BeamRiderNoFrameskip-v4] https://user-images.githubusercontent.com/13140602/63202618-bd246200-c047-11e9-8125-02ee159063ef.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/5278?email_source=notifications&email_token=AAADUSTDOES5AZWBDL2T6VDQE4W6RA5CNFSM4IG7TFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4P4QMA#issuecomment-522176560, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSQEH3BJ2YYWNIOSBLLQE4W6RANCNFSM4IG7TFHQ .

josjo80 commented 5 years ago

Ok. Thanks for the feedback again. Not sure what's wrong with my setup but there seems to be a clue in comparing training times. I let BeamRiderNoFrameskip-v4 run over the weekend and it took 13hrs to reach 1.5M timesteps with an average reward @ 619.

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 11/16 CPUs, 4/4 GPUs
Memory usage on this node: 61.7/134.8 GB
Result logdir: /home/johnson/ray_results/ppo_atari
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_BeamRiderNoFrameskip-v4_0:   RUNNING, [11 CPUs, 4 GPUs], [pid=23244], 47195 s, 299 iter, 1495000 ts, 619 rew

I checked online and one benchmark puts the RTX at 80% of the performance of the V100. But with 4 RTX's I should have more than made up for that.

When I run RLlib and request all 4 GPUs (i.e. num_gpus=4), it tells me that I am using them but when I check with nvidia-smi it shows all memory is allocated to one GPU and that one is only using 1-3% utility. I thought this might be due to the nature of PPO synchronizing the gradient optimization on experiences, but even APPO will typically show a low utilization.

Appreciate any feedback.

josjo80 commented 5 years ago

I noticed that you're hardware setup is achieving ~1400 training ts / wall-clock sec. Whereas, I'm only achieving 25-30 ts / wall-clock sec with 11 CPUs. If I only use 1 CPU I get up to 60 ts / wall-clock sec.
Do you think this could affect the PPO learning rate - in terms of average reward / ts? Do you think there is an issue with CPU bus bandwidth? It appears that changing the num_gpus does not affect this number. Have you seen this problem before?

ericl commented 5 years ago

Maybe try 1 gpu? I rarely use more than 1 since it typically provides diminishing returns for Atari sized models and batches.

On Tue, Aug 20, 2019, 3:02 AM Joshua Johnson notifications@github.com wrote:

I noticed that you're hardware setup is achieving ~1400 training ts / wall-clock sec. Whereas, I'm only achieving 25-30 ts / wall-clock sec with 11 CPUs. If I only use 1 CPU I get up to 60 ts / wall-clock sec. Do you think this could affect the PPO learning rate - in terms of average reward / ts? Do you think there is an issue with CPU bus bandwidth? It appears that changing the num_gpus does not affect this number. Have you seen this problem before?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/5278?email_source=notifications&email_token=AAADUSXQFAAVQHWWQPY7SK3QFLU3TA5CNFSM4IG7TFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4T7DYA#issuecomment-522711520, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSSDWBEN7BCTNI2T3N3QFLU3TANCNFSM4IG7TFHQ .

josjo80 commented 5 years ago

I tried using 1 GPU and it remains slow at about 25 ts / wall-clock sec.

I'm curious about the various time metrics reported out per iteration. For instance, grad_time_ms below indicates 185497.539 ms and time_this_iter_s is 189.34 s. Sometimes I've seen grad_time_ms larger than time_this_iter_s, which I don't understand how that is possible. And when I tally up the other time metrics they don't seem to add up to time_this_iter_s either. Is there some documentation/interpretation of these metrics? Or are you able to publish your iteration metrics? It would be helpful for me to understand where my bottleneck is.

Thanks!

Result for PPO_BeamRiderNoFrameskip-v4_0:
  custom_metrics: {}
  date: 2019-08-20_14-15-56
  done: false
  episode_len_mean: 4181.6
  episode_reward_max: 308.0
  episode_reward_mean: 149.6
  episode_reward_min: 44.0
  episodes_this_iter: 4
  episodes_total: 5
  experiment_id: 532ad40208a74db18db6364e6e734082
  hostname: cassini
  info:
    grad_time_ms: 185497.539
    learner:
      default_policy:
        cur_kl_coeff: 0.375
        cur_lr: 9.999999747378752e-05
        entropy: 0.9963899850845337
        entropy_coeff: 0.009999999776482582
        kl: 0.013735565356910229
        policy_loss: -0.029520530253648758
        total_loss: -0.008013482205569744
        vf_explained_var: 0.44205254316329956
        vf_loss: 0.02632010541856289
    load_time_ms: 1049.737
    num_steps_sampled: 60000
    num_steps_trained: 60000
    sample_time_ms: 2836.502
    update_time_ms: 21.598
  iterations_since_restore: 12
  node_ip: 10.1.10.142
  num_healthy_workers: 10
  off_policy_estimator: {}
  pid: 7660
  policy_reward_mean: {}
  sampler_perf:
    mean_env_wait_ms: 15.76998344180198
    mean_inference_ms: 4.716223501473037
    mean_processing_ms: 2.9654766523015694
  time_since_restore: 2279.9584567546844
  time_this_iter_s: 189.34279656410217
  time_total_s: 2279.9584567546844
  timestamp: 1566310556
  timesteps_since_restore: 60000
  timesteps_this_iter: 5000
  timesteps_total: 60000
  training_iteration: 12
  trial_id: b0c6b7e6

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 11/16 CPUs, 1/4 GPUs
Memory usage on this node: 22.1/134.8 GB
Result logdir: /home/johnson/ray_results/ppo_atari
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_BeamRiderNoFrameskip-v4_0:   RUNNING, [11 CPUs, 1 GPUs], [pid=7660], 2279 s, 12 iter, 60000 ts, 150 rew
josjo80 commented 5 years ago

Edit: I uninstalled tensorflow 1.14 and installed tensorflow-gpu 1.14. I can now see memory allocated to all the GPUs and training process running using nvidia-smi. However, the speed is still quite slow with num_gpu=1 and I get about a 3x speed up when num_gpu=4.

For reference, grad_time_ms is still ~185000 ms for 1 GPU. When I request no GPUs I get the same speed.

ericl commented 5 years ago

Could be, I'm guessing this isn't anything RLlib related but a TF/cuda issue since we don't really do anything special. 1.14 should work fine too.

On Wed, Aug 21, 2019, 1:06 AM Joshua Johnson notifications@github.com wrote:

BTW, I'm using TF 1.14 . Do I need to use tensorflow-gpu? or 1.13 instead?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/5278?email_source=notifications&email_token=AAADUSRYXVVFRNPYSN274ODQFQP7VA5CNFSM4IG7TFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4W7R5Y#issuecomment-523106551, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSTXZMU7BGTUI5XHW23QFQP7VANCNFSM4IG7TFHQ .

josjo80 commented 5 years ago

Yeah, I came to the same conclusion as well. Drilling down into CUDA and CuDNN.

RedTachyon commented 5 years ago

Any news on this? I'm noticing similar problems, wondering if it'd be worth it to just write a PPO from scratch in case it's something wrong with this one.

josjo80 commented 5 years ago

My issue turned out to be enabling CuDNN at the driver level. I've been using the PPO algorithm both on StarCraft and other environments. It seems to work just fine. Some tips: pay attention to what @ericl mentioned regarding the vf_coeff. The vf_loss could be much larger than the policy loss (depending on the game), so use the vf_coeff to scale it down inline with the policy loss. Pay attention to the LSTM architecture. There's some subtleties in the architecture that may impact peformance. I based my architecture off of: https://github.com/ray-project/ray/blob/master/rllib/examples/custom_keras_rnn_model.py From what I can tell, LSTMs train more slowly but will reach a higher ultimate average reward. FF networks will train faster in the beginning but will max out earlier. Still not sure why exactly. But if you follow OpenAI's work on both their Dota2 Five network architecture and their robotic arm manipulation paper then you'll see that they highly recommend using an LSTM. Lastly, I would advise you to be patient with the training run. Any environment complex enough to need an LSTM and scalable PPO is going to take awhile to train and see results. I literally spent weeks iterating through the PPO hyperparameters and network architecture. Some of that time was spent upgrading to the latest keras API but much was also spent trying to get hyperparameters that resulted in replicable results. Try to change one parameter at a time if possible and remember there's a lot of variance in results regardless of making changes. So, try to pick out signal from noise.

toanngosy commented 4 years ago

Is this true that the vf_loss in the reporter is the vf_loss after scaling with vf_coeff?

josjo80 commented 4 years ago

I don't believe so. Through my observations, it appears to be the loss before the coefficient is applied.

eugenevinitsky commented 4 years ago

@josjo80, could you maybe post the architecture and hyperparams that did work for you in the end? It'd be great as a starting point to iterate on. Obviously any result is reward specific, but it'd still be really helpful.

josjo80 commented 4 years ago

Below are results for the SMAC environment.

SMAC Results

Avg Reward @ 45M steps =   21.6
Battles won @ 45M steps = 93%
Model = LSTM
limit=10k (I had to adjust the max episode time in SMAC arguments to 10k steps.  Make hard code change to file here under MMM battle: https://github.com/oxwhirl/smac/blob/master/smac/env/starcraft2/maps/smac_maps.py)

LSTM architecture (imported into PPO script as rnn4d.py)

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np

from ray.rllib.models import ModelCatalog
from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.models.tf.recurrent_tf_modelv2 import RecurrentTFModelV2
from ray.rllib.models.tf.misc import normc_initializer
from ray.rllib.policy.rnn_sequencing import add_time_dimension
from ray.rllib.utils.annotations import override, DeveloperAPI
from ray.rllib.utils import try_import_tf

tf = try_import_tf()

@DeveloperAPI
class MaskedActionsLSTM(RecurrentTFModelV2):
    """Custom RLlib model that emits -inf logits for invalid actions.

    This is used to handle the variable-length StarCraft action space.
    """
    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name, **kw):
        super(MaskedActionsLSTM, self).__init__(obs_space, action_space, num_outputs, model_config, name, **kw)

        self.cell_size = model_config.get("lstm_cell_size")
        self.fcnet_hiddens = model_config.get("fcnet_hiddens")

        # Define input layers
        input_layer = tf.keras.layers.Input(
            shape=(None, 160))
        state_in_h = tf.keras.layers.Input(shape=(self.cell_size, ))
        state_in_c = tf.keras.layers.Input(shape=(self.cell_size, ))
        seq_in = tf.keras.layers.Input(shape=())

        # Preprocess observation with a hidden layer and send to LSTM cell
        dense1 = tf.keras.layers.Dense(
            self.fcnet_hiddens, activation=tf.nn.relu, name="dense1")(input_layer)
        lstm_out, state_h, state_c = tf.keras.layers.LSTM(
            self.cell_size, return_sequences=True, return_state=True, name="lstm")(
                inputs=dense1,
                mask=tf.sequence_mask(seq_in),
                initial_state=[state_in_h, state_in_c])

        # Postprocess LSTM output with another hidden layer and compute values
        logits = tf.keras.layers.Dense(
            self.num_outputs,
            activation=tf.keras.activations.linear,
            name="logits")(lstm_out)
        values = tf.keras.layers.Dense(
            1, activation=None, name="values")(lstm_out)

        # Create the RNN model
        self.rnn_model = tf.keras.Model(
            inputs=[input_layer, seq_in, state_in_h, state_in_c],
            outputs=[logits, values, state_h, state_c])
        self.register_variables(self.rnn_model.variables)
        self.rnn_model.summary()

    @override(RecurrentTFModelV2)
    def forward(self, input_dict, state, seq_lens):
        """Adds time dimension to batch before sending inputs to forward_rnn().
        You should implement forward_rnn() in your subclass."""
        action_mask = input_dict["obs"]["action_mask"]
        action_logits, new_state = self.forward_rnn(
            add_time_dimension(input_dict["obs"]["obs"], seq_lens), state,
            seq_lens)

        action_logits = tf.reshape(action_logits, [-1, self.num_outputs])

        # Mask out invalid actions (use tf.float32.min for stability)
        inf_mask = tf.maximum(tf.log(action_mask), tf.float32.min)
        masked_logits = inf_mask + action_logits

        return masked_logits, new_state

    @override(RecurrentTFModelV2)
    def forward_rnn(self, inputs, state, seq_lens):
        model_out, self._value_out, h, c = self.rnn_model([inputs, seq_lens] +
                                                          state)
        return model_out, [h, c]

    @override(ModelV2)
    def get_initial_state(self):
        return [
            np.zeros(self.cell_size, np.float32),
            np.zeros(self.cell_size, np.float32),
        ]

    @override(ModelV2)
    def value_function(self):
        return tf.reshape(self._value_out, [-1])

Below is the PPO training script with hyperparameters:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import logging
LOG_FILENAME = 'logging2.out'
logging.basicConfig(filename=LOG_FILENAME, level=logging.DEBUG)

logging.debug('This message should go to the log file')

"""Example of running StarCraft2 with RLlib PPO.

In this setup, each agent will be controlled by an independent PPO policy.
However the policies share weights.

Increase the level of parallelism by changing --num-workers.
"""
import argparse
import numpy as np

import ray
from ray import tune
from ray.tune import run_experiments, register_env
from ray.rllib.models import ModelCatalog

from smac.examples.rllib.env import RLlibStarCraft2Env
from smac.examples.rllib.rnn4d import MaskedActionsLSTM

def on_episode_start(info):
    episode = info["episode"]
    episode.user_data["step_wins"] = []

def on_episode_step(info):
    episode = info["episode"]
    try:
        outcome = float(episode.last_info_for(0)["battle_won"])
    except:
        outcome = 0.
    episode.user_data["step_wins"].append(outcome)

def on_episode_end(info):
    episode = info["episode"]
    episode_wins = np.sum(episode.user_data["step_wins"])
    episode.custom_metrics["episode_wins"] = episode_wins

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--num-iters", type=int, default=600)
    parser.add_argument("--num-workers", type=int, default=15)
    parser.add_argument("--num-envs-per-worker", type=int, default=11)
    parser.add_argument("--num-gpus", type=int, default=4)
    parser.add_argument("--map-name", type=str, default="MMM")
    args = parser.parse_args()

    ray.init()

    register_env("smac", lambda smac_args: RLlibStarCraft2Env(**smac_args))
    ModelCatalog.register_custom_model("mask_model", MaskedActionsLSTM)

    try:
        run_experiments({
            "ppo_sc2": {
                "run": "PPO",
                "env": "smac",
                "stop": {
                    "training_iteration": args.num_iters,
                },
                "checkpoint_freq": 10,
                "config": {
                    "num_workers": args.num_workers,
                    "num_envs_per_worker": args.num_envs_per_worker,
                    "num_gpus": args.num_gpus,
                    "ignore_worker_failures": True,
                    "train_batch_size": 50000,
                    "sgd_minibatch_size": 5000,  #Remove for APPO
                    #"sample_batch_size": 30,    #Add for APPO, remove for PPO
                    "lr": 1e-4,
                    "lambda": .995,
                    "kl_coeff": 1.0,    #Remove for APPO
                    "clip_param": 0.2,
                    "num_sgd_iter": 10,
                    "observation_filter": "NoFilter",  # breaks the action mask
                    "vf_share_layers": True,  # don't create a separate value model (remove for APPO)
                    #"vf_loss_coeff": 1e-3,    #VF loss is error^2, so it can be really out of scale compared to the policy loss. 
                                              #Ref: https://github.com/ray-project/ray/issues/5278
                    "env_config": {
                        "map_name": args.map_name,
                    },
                    "model": {
                        "custom_model": "mask_model",
                        "fcnet_hiddens": 64,
                        "lstm_cell_size": 64,
                        "max_seq_len": 100,
                    },
                    "callbacks": {
                    "on_episode_start": tune.function(on_episode_start),
                    "on_episode_step": tune.function(on_episode_step),
                    "on_episode_end": tune.function(on_episode_end)
                    },
                },
            },
         })
    except:
        logging.exception('Got exception on main handler')
        raise
eugenevinitsky commented 4 years ago

Thank you so much!

josjo80 commented 4 years ago

No problem. Do me a favor and if you find better hyperparameters or architectures reply back here so I can learn from you as well!

ericl commented 4 years ago

Auto-closing stale issue.