microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.37k stars 94 forks source link

MuP for RNNs #77

Open norikazu99 opened 2 months ago

norikazu99 commented 2 months ago

Hello, Your paper seems to have covered linear layers, convs, and transformers but not rnns. Was it just to reduce the number of experiments or is their a more fundamental reason behind this choice. If it was just to reduce n_experiments, how would h0 be handeled? Would you recommend zeroing out h0, or it needs to be initialized using mup.init.normal.

Thank you.