MuP for RNNs - Githubissues

Hello, Your paper seems to have covered linear layers, convs, and transformers but not rnns. Was it just to reduce the number of experiments or is their a more fundamental reason behind this choice. If it was just to reduce n_experiments, how would h0 be handeled? Would you recommend zeroing out h0, or it needs to be initialized using mup.init.normal.

Thank you.