Closed pengzhi1998 closed 2 years ago
Hi, I'm not an expert, but I implemented PPO from scratch (so it's messy) on another project (see my humble repo). My conclusion was that PPO won't work alone, it needs all the standard tricks.
Thank you! @AlpoGIT I'll take a look :)
You can check the April Update which is a bit more stable. Better Advantage estimate like GAE should also stabilize the training.
Hi, I'm using your great implementation of PPO (discrete) on another project of robot's obstacle avoidance.
Actually, I was using DDDQN to train the robot's motion. The training was successful. Then, I used the same network and reward function from DDDQN implementation for this PPO implementation. And I tried several different sets of hyper-parameters (to change the values of lr, update_timestep and k_epochs and etc). Nevertheless, none of them work for it.
Actually, the robot seems to have learnt nothing after even 10 hours of training. And the reward remains very low. Do you know what kind of problem it could be? Will this be the problem of hyper-parameters, network structures or just some kinds of logical error? And will python2 be a problem? (but actually, python2 works for this implementation on Cartpole)
Really look forward to your reply! And thank you again for your PPO implementation!