Investigate the effect of Dropout / Stochastic Depth on Model training/interpretability

From Gato paper: "Regularization: We train with an AdamW weight decay parameter of 0.1. Additionally, we use stochastic depth (Huang et al., 2016) during pretraining, where each of the transformer sub-layers (i.e. each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1."

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. Preprint arXiv:1603.09382, 2016.

Stochastic depth seems plausibly super valuable to me via intutions. I should read that paper at some point - Joseph

jbloomAus / DecisionTransformerInterpretability

Investigate the effect of Dropout / Stochastic Depth on Model training/interpretability #58