Improve PPO baselines when there are no partial resets

StoneT2000 commented 4 weeks ago

They perform pretty poorly when there aren't partial resets, unclear exactly why this occurs. Could be a small bug in the PPO implementation. Initial experimentation suggests the reward scale has a lot of impact on performance than usual, and value loss explodes when tasks begin to succeed.

sean1295 commented 4 weeks ago

A possible reason could be due to using logstd without clipping std? you can try with

    self.actor_std = nn.Parameter(torch.ones(1, np.prod(env.single_action_space.shape)) * 0.1)

and do

    action_std = torch.clamp(self.actor_std, min=1e-5, max=0.1).expand_as(action_mean)

StoneT2000 commented 4 weeks ago

You can try testing it with

seed=1
python ppo_rgb.py --env_id="PushT-v1" --seed=${seed} \
    --num_envs=256 --update_epochs=8 --num_minibatches=8 --reward_scale=1 \
    --total_timesteps=50_000_000 --num-steps=100 --num_eval_steps=100 --gamma=0.99 \
    --no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
    --exp-name="ppo-PushT-v1-rgb-${seed}-walltime_efficient"

and the partial reset version

python ppo.py --env_id="PushT-v1" \
  --num_envs=1024 --update_epochs=8 --num_minibatches=32 \
  --total_timesteps=25_000_000 --num-steps=100 --num_eval_steps=100 --gamma=0.99

I think the no partial reset version runs much slower than expected. I have to set reward_scale=0.1 to avoid value loss explosion. I can also give your trick a try later.

StoneT2000 commented 4 weeks ago

unfortunately did not work. Black line is with reward scaling of 0.1, other line is with clamping and no reward scaling. normally I do not do any reward scaling with the ppo example code in here: https://github.com/haosulab/ManiSkill/blob/main/examples/baselines/ppo/examples.sh

haosulab / ManiSkill

Improve PPO baselines when there are no partial resets #496