nslyubaykin / rnns_for_pomdp

Recurrent Policies for Handling Partially Observable Environments
2 stars 0 forks source link

lstm+ppo cannot converge in Pendulum-v0 environment #2

Open 1900360 opened 1 year ago

1900360 commented 1 year ago

Hi @nslyubaykin lstm+ppo cannot converge in Pendulum-v0 environment, I don't know there is some setting error in my code, could you check it for a moment? reward curve shown as below: image lstm_parallel_ppo.txt

nslyubaykin commented 1 year ago

Hi @1900360!

Did parameters mentioned in this issue help for that task?

1900360 commented 1 year ago

Hi @nslyubaykin Sure, here are the parameters:

actor = PPO( policy_net=NormalLSTM(obs_dim, acs_dim, nlayers_lstm=2, seq_len=1+n_lags, nunits_lstm=32, nunits_dense=8, out_activation=torch.nn.Identity(), init_log_std=-1.5), device=torch.device('cpu'), learning_rate=1e-4, n_epochs_per_update=50, batch_size=5000, target_kl=0.2, eps=0.2, gamma=0.9, obs_nlags=n_lags, obs_expand_axis=0, obs_concat_axis=0, obs_padding='zeros', standardize_advantages=True, weight_decay=0.0 )

critic = GAE( critic_net=VLSTM(obs_dim, nlayers_lstm=2, seq_len=1+n_lags, nunits_lstm=32, nunits_dense=8), device=torch.device('cpu'), learning_rate=1e-4, batch_size=5000, gamma=0.9, gae_lambda=0.95, n_target_updates=1, n_steps_per_update=50, obs_nlags=n_lags, obs_expand_axis=0, obs_concat_axis=0, obs_padding='zeros' )

parallel_sampler = ParallelSampler(env=envs, obs_nlags=n_lags, obs_expand_axis=0, obs_concat_axis=0, gpus_share=0, obs_padding='zeros')

for step in tqdm(range(n_steps)): actor.set_critic(None) actor.set_device(torch.device('cpu')) train_batch = parallel_sampler.sample(actor=actor, n_transitions=5000, max_path_length=None, reset_when_not_done=False, train_sampling=True) actor.set_device(torch.device('cpu')) actor.set_critic(critic) critic_logs = critic.update(train_batch) actor_logs = actor.update(train_batch)

The process of training is too slow for the simple gym environment, is there some improved space?

1900360 commented 1 year ago

Waiting for your answer, I'm very interested in this:D

nslyubaykin commented 1 year ago

Hi @1900360!

I am not sure if I understand correctly what do you mean by slow training. Is the convergence is slower itself, or it is computationally slower? And the second question, what do you mean by improved space?

1900360 commented 1 year ago

Hi @nslyubaykin! Sorry for my unclear statement. I mean 'slow training' is not only convergence but also computing resources mainly spend on these steps:

critic_logs = critic.update(train_batch) actor_logs = actor.update(train_batch) I have use parameters from parallel_ppo, but I still get these reward curve, so I wonder whether my parameters settings is correct: image

1900360 commented 1 year ago

Do you have any idea? I didn't get anything since I'm a freshman in DRL :)

nslyubaykin commented 1 year ago

Hi @1900360!

The reason for the slower computation is the fact that your policy is now dealing with larger observations (obs_dim*n_lags) and other things equal has more parameters (new architecture may also affect it). Plus there is some minor computational overhead with creating and processing lags. Regarding training divergence, one option is that you need just to find a new right set of hyper-parameters for this new architecture (which can be found only by trial and error). The other option is that this environment performance is just harmed by introducing lags. According to my experience, when the observations are already fully observable, using lags may harm the performance by adding redundant information to an observation.

Also using

out_activation=torch.nn.Tanh()
acs_scale=2

with NormalLSTM is preferable for that task.