shikharbahl / neural-dynamic-policies

MIT License
63 stars 13 forks source link

How can I verify your algorithm performance? #5

Closed OhJeongwoo closed 3 years ago

OhJeongwoo commented 3 years ago

Hello, I have some questions while running your NDP algorithm.

I finished environment settings and run code about 8hrs with $ sh run_rl.sh faucet dmp 2 1

however it seems not working as well. 스크린샷, 2021-05-24 04-03-02

My questions are

  1. How much time to train ppo and ndp algorithm for example environment?
  2. How can I know or visualize result plot?
  3. I am wondering about dmp_train.py code line 45 to line 72
for j in range(num_updates):
        if args.use_linear_lr_decay:
            utils.update_linear_schedule(
                agent.optimizer, j, num_updates,
                agent.optimizer.lr if args.algo == "acktr" else args.lr)
        envs.reset()
        for step in range(args.num_steps):
            if step % args.T == 0:
                with torch.no_grad():
                    values, actions, action_log_probs_list, recurrent_hidden_states_lst = actor_critic.act(
                        rollouts.obs[step], rollouts.recurrent_hidden_states[step],
                        rollouts.masks[step])

                action = actions[step % args.T]
                action_log_probs = action_log_probs_list[step % args.T]
                recurrent_hidden_states = recurrent_hidden_states_lst[0]
                value = values[:, step % args.T].view(-1, 1)

            obs, reward, done, infos = envs.step(action)

            episode_rewards.append(reward[0].item())
            masks = torch.FloatTensor(
                [[0.0] if done_ else [1.0] for done_ in done])
            bad_masks = torch.FloatTensor(
                [[0.0] if 'bad_transition' in info.keys() else [1.0]
                 for info in infos])
            rollouts.insert(obs, recurrent_hidden_states, action,
                            action_log_probs, value, reward, masks, bad_masks)

in this code, you input same action for step Ts to T(s+1)-1 (T = args.T), but i think that for each step, it is right to set action to actions[step%args.N] since dmp actor outputs N steps(N=args.N) actions. could you explain more detail about this part?

Thank you!

shikharbahl commented 3 years ago

Hi @OhJeongwoo,

To answer your questions:

1) How much time to train ppo and ndp algorithm for example environment?: It should take around 3M steps to see some good results, which should be around 4-6 hours (but this could vary drastically based on your machine.

2) How can I know or visualize result plot?: The logs are saved in epoch_data.npy, you can use any visualization package you want. I will update the repo with some visualization code shortly.

3) I am wondering about dmp_train.py code line 45 to line 72: Actually, args.T is the length of the NDP rollout. Every T steps, the NDP will output DMP parameters and execute the trajectory for T steps. So we should take the kth action for k=0, ..., T-1. More details cand be found in: https://arxiv.org/pdf/2012.02788.pdf

ottofabian commented 2 years ago

I have to disagree with your answer regarding 3. While it is true that you generate a new set of weights every T timesteps, I agree with @OhJeongwoo that the step action execution is incorrect. From the resulting trajectory of length T (here actions), you only ever use the first action for T steps, as action is never updated before the next DMP weights are sampled.