nicklashansen / tdmpc

Code for "Temporal Difference Learning for Model Predictive Control"
MIT License
346 stars 55 forks source link

Intuition behind using zero initialization in critic and reward model last layer #13

Closed hdadong closed 1 year ago

hdadong commented 1 year ago

Hi, while I was reading through the code implementation, I found that you use zero initialization in critic and reward model last layer. What is the reason for this? I guess this can avoid some overestimation, is that right? Other question is: why you choose the orthogonal_init except last layer, is there any theory behind this? zero initialization Thanks in advance!

nicklashansen commented 1 year ago

Hi, Thank you for your interest in our work. To answer your questions: (1) We zero init the last layer in critic + reward to ensure that planning with an untrained network assigns equal value to all trajectories. This has a relatively small impact on downstream task performance, but is a neat property nonetheless. (2) We use orthogonal init to be consistent with prior work, e.g. https://github.com/facebookresearch/drqv2. In my experience, this also has a relatively small (but not negligible) impact on performance. I am closing the issue, but feel free to reopen if you have additional questions.

Best,