Open 1900360 opened 1 year ago
Hi @1900360!
Did parameters mentioned in this issue help for that task?
Hi @nslyubaykin Sure, here are the parameters:
actor = PPO( policy_net=NormalLSTM(obs_dim, acs_dim, nlayers_lstm=2, seq_len=1+n_lags, nunits_lstm=32, nunits_dense=8, out_activation=torch.nn.Identity(), init_log_std=-1.5), device=torch.device('cpu'), learning_rate=1e-4, n_epochs_per_update=50, batch_size=5000, target_kl=0.2, eps=0.2, gamma=0.9, obs_nlags=n_lags, obs_expand_axis=0, obs_concat_axis=0, obs_padding='zeros', standardize_advantages=True, weight_decay=0.0 )
critic = GAE( critic_net=VLSTM(obs_dim, nlayers_lstm=2, seq_len=1+n_lags, nunits_lstm=32, nunits_dense=8), device=torch.device('cpu'), learning_rate=1e-4, batch_size=5000, gamma=0.9, gae_lambda=0.95, n_target_updates=1, n_steps_per_update=50, obs_nlags=n_lags, obs_expand_axis=0, obs_concat_axis=0, obs_padding='zeros' )
parallel_sampler = ParallelSampler(env=envs, obs_nlags=n_lags, obs_expand_axis=0, obs_concat_axis=0, gpus_share=0, obs_padding='zeros')
for step in tqdm(range(n_steps)): actor.set_critic(None) actor.set_device(torch.device('cpu')) train_batch = parallel_sampler.sample(actor=actor, n_transitions=5000, max_path_length=None, reset_when_not_done=False, train_sampling=True) actor.set_device(torch.device('cpu')) actor.set_critic(critic) critic_logs = critic.update(train_batch) actor_logs = actor.update(train_batch)
The process of training is too slow for the simple gym environment, is there some improved space?
Waiting for your answer, I'm very interested in this:D
Hi @1900360!
I am not sure if I understand correctly what do you mean by slow training. Is the convergence is slower itself, or it is computationally slower? And the second question, what do you mean by improved space?
Hi @nslyubaykin! Sorry for my unclear statement. I mean 'slow training' is not only convergence but also computing resources mainly spend on these steps:
critic_logs = critic.update(train_batch) actor_logs = actor.update(train_batch) I have use parameters from parallel_ppo, but I still get these reward curve, so I wonder whether my parameters settings is correct:
Do you have any idea? I didn't get anything since I'm a freshman in DRL :)
Hi @1900360!
The reason for the slower computation is the fact that your policy is now dealing with larger observations (obs_dim*n_lags) and other things equal has more parameters (new architecture may also affect it). Plus there is some minor computational overhead with creating and processing lags. Regarding training divergence, one option is that you need just to find a new right set of hyper-parameters for this new architecture (which can be found only by trial and error). The other option is that this environment performance is just harmed by introducing lags. According to my experience, when the observations are already fully observable, using lags may harm the performance by adding redundant information to an observation.
Also using
out_activation=torch.nn.Tanh()
acs_scale=2
with NormalLSTM
is preferable for that task.
Hi @nslyubaykin lstm+ppo cannot converge in Pendulum-v0 environment, I don't know there is some setting error in my code, could you check it for a moment? reward curve shown as below: lstm_parallel_ppo.txt