jbloomAus / DecisionTransformerInterpretability

Interpreting how transformers simulate agents performing RL tasks
https://jbloomaus-decisiontransformerinterpretability-app-4edcnc.streamlit.app/
MIT License
62 stars 15 forks source link

Investigate the effect of Dropout / Stochastic Depth on Model training/interpretability #58

Open jbloomAus opened 1 year ago

jbloomAus commented 1 year ago

From Gato paper: "Regularization: We train with an AdamW weight decay parameter of 0.1. Additionally, we use stochastic depth (Huang et al., 2016) during pretraining, where each of the transformer sub-layers (i.e. each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1."

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. Preprint arXiv:1603.09382, 2016.

Stochastic depth seems plausibly super valuable to me via intutions. I should read that paper at some point - Joseph