Closed HyeongYeolRyu closed 3 years ago
Thank you for the kind words!
It looks like in the MAML paper they only used a single step of gradient descent during training for all experiments in RL (so I guess including the one for 2D-Navigation). From Appendix A.2:
In all reinforcement learning experiments, the MAML policy was trained using a single gradient step with α = 0.1.
That seems to be consistent with what you find (a return of -10 after 1 step). The results with multiple gradient steps must then be limited to evaluation, and you're right there is a special evaluation scheme. What you found is very interesting though: it seems that training directly with 5 steps of adaptation is not performing as well as using only 1 single step (there is some kind of "overfitting" happening where the performance decreases on the 4th gradient step).
Now for using different learning rates for each step, I don't think there's an easy way to do that with the current code unfortunately. However it shouldn't be too hard to modify the fast_lr
argument to accept a list of learning rates. I can take a look at that!
Thank you for your quick reply and consideration! Yeah, I also considered the overfitting problem. But in that case, the result value after 1 gradient step would be around -10, not -25 as in the resulting graph because when 1 gradient step is done, the average return should be at least -10. This discrepancy causes some curiosity so I run some experiments with different codes. If I find something new, I will let you know.
Thanks again!
Hi, thank you for providing great implementations! I've learned a lot from this repo, which is pretty easy-understanding and fast. My question is that for the 2d-navigation task, I trained num_steps=5 and tested it, but the results are quite different from those in the original paper. I edited the test.py code like the followings:
after saw https://github.com/tristandeleu/pytorch-maml-rl/issues/26#issuecomment-573679714.
To see the results, I did something like this.
However, as the figure shows, the result values are far from what we want.
Figure from the paper.
I also tested just 1 gradient step, which shows about -10, similar to the original paper. Only more gradient steps are the problem.
And one more thing, the paper says for evaluation, they used fast learning rate=0.1 for 1 gradient step, then halved it to 0.05 for all future. But in this implementation, I can't find out that. Isn't this a critical thing? Since I didn't check this, so now I'm struggling to modify the codes to follow the original paper.
Thank you very much in advance!