Intuition behind using zero initialization in critic and reward model last layer

nicklashansen / tdmpc

Code for "Temporal Difference Learning for Model Predictive Control"

MIT License

346 stars 55 forks source link

Hi, Thank you for your interest in our work. To answer your questions: (1) We zero init the last layer in critic + reward to ensure that planning with an untrained network assigns equal value to all trajectories. This has a relatively small impact on downstream task performance, but is a neat property nonetheless. (2) We use orthogonal init to be consistent with prior work, e.g. https://github.com/facebookresearch/drqv2. In my experience, this also has a relatively small (but not negligible) impact on performance. I am closing the issue, but feel free to reopen if you have additional questions.

Best,

nicklashansen / tdmpc

Intuition behind using zero initialization in critic and reward model last layer #13