tristandeleu / pytorch-maml-rl

Reinforcement Learning with Model-Agnostic Meta-Learning in Pytorch
MIT License
827 stars 158 forks source link

Questions about multi-gradient steps #46

Closed HyeongYeolRyu closed 3 years ago

HyeongYeolRyu commented 4 years ago

Hi, thank you for providing great implementations! I've learned a lot from this repo, which is pretty easy-understanding and fast. My question is that for the 2d-navigation task, I trained num_steps=5 and tested it, but the results are quite different from those in the original paper. I edited the test.py code like the followings:

# test.py
...
for batch in trange(args.num_batches):
        tasks = sampler.sample_tasks(num_tasks=args.meta_batch_size)
        train_episodes, valid_episodes = sampler.sample(tasks,
                                                        num_steps=config['num-steps'], # num_steps=5
                                                        fast_lr=config['fast-lr'],
                                                        gamma=config['gamma'],
                                                        gae_lambda=config['gae-lambda'],
                                                        device=args.device)

        logs['tasks'].extend(tasks)
        grad0_returns.append(get_returns(train_episodes[0]))
        grad1_returns.append(get_returns(train_episodes[1]))
        grad2_returns.append(get_returns(train_episodes[2]))
        grad3_returns.append(get_returns(train_episodes[3]))
        grad4_returns.append(get_returns(train_episodes[4]))
...
logs['grad0_returns'] = np.concatenate(grad0_returns, axis=0)
logs['grad1_returns'] = np.concatenate(grad1_returns, axis=0)
...

after saw https://github.com/tristandeleu/pytorch-maml-rl/issues/26#issuecomment-573679714.

To see the results, I did something like this.

...
data = np.load('path-to-results')
grad0_returns = data['grad0_returns']
grad1_returns = data['grad1_returns']
...
val0 = grad0_returns.mean()
val1 = grad1_returns.mean()
...

However, as the figure shows, the result values are far from what we want. Figure_1

figure2 Figure from the paper.

I also tested just 1 gradient step, which shows about -10, similar to the original paper. Only more gradient steps are the problem.

And one more thing, the paper says for evaluation, they used fast learning rate=0.1 for 1 gradient step, then halved it to 0.05 for all future. But in this implementation, I can't find out that. Isn't this a critical thing? Since I didn't check this, so now I'm struggling to modify the codes to follow the original paper.

Thank you very much in advance!

tristandeleu commented 4 years ago

Thank you for the kind words!

It looks like in the MAML paper they only used a single step of gradient descent during training for all experiments in RL (so I guess including the one for 2D-Navigation). From Appendix A.2:

In all reinforcement learning experiments, the MAML policy was trained using a single gradient step with α = 0.1.

That seems to be consistent with what you find (a return of -10 after 1 step). The results with multiple gradient steps must then be limited to evaluation, and you're right there is a special evaluation scheme. What you found is very interesting though: it seems that training directly with 5 steps of adaptation is not performing as well as using only 1 single step (there is some kind of "overfitting" happening where the performance decreases on the 4th gradient step).

Now for using different learning rates for each step, I don't think there's an easy way to do that with the current code unfortunately. However it shouldn't be too hard to modify the fast_lr argument to accept a list of learning rates. I can take a look at that!

HyeongYeolRyu commented 4 years ago

Thank you for your quick reply and consideration! Yeah, I also considered the overfitting problem. But in that case, the result value after 1 gradient step would be around -10, not -25 as in the resulting graph because when 1 gradient step is done, the average return should be at least -10. This discrepancy causes some curiosity so I run some experiments with different codes. If I find something new, I will let you know.

Thanks again!