Conference :
Link : http://arxiv.org/abs/2102.06810Authors' Affiliation : Facebook AI Research
TL;DR : SimSiam 논문의 원인을 탐구하는 논문. "How can SSL with only positive pairs avoid representational collapse?"
Summary :
1. Introduction
Minimizing differences between positive pairs encourages modeling invariances, while contrasting negative pairs is thought to be required to prevent representational collapse
그러나 요즘 나온 BYOL, SimSiam 과 같은 논문들에서는 negative없이 성공했다.
근데 이 논문들이 왜 representation collapse를 겪지 않는건지는 아직 밝혀지지 않았다.
analyze the behavior of non-contrastive SSL training and the empirical effects of multiple hyperparameters, including (1) Exponential Moving Average (EMA) or momentum encoder, (2) Higher relative learning rate (αp) of the predictor, and (3) Weight decay η
We explain all these empirical finding with an exceedingly simple theory based on analyzing the nonlinear learning dynamics of simple linear networks.
Essential part of non-contrastive SSL
EMA
Predictor Optimality and Relative learning rate
Weight Decay
DirectPred
2. Two-layer linear model
Theorem 1 (Weight decay promotes balancing of the predictor and online networks
Theorem 2 (The stop-gradient signal is essential for success.)
Conference : Link : http://arxiv.org/abs/2102.06810 Authors' Affiliation : Facebook AI Research TL;DR : SimSiam 논문의 원인을 탐구하는 논문. "How can SSL with only positive pairs avoid representational collapse?"
Summary :
1. Introduction
Minimizing differences between positive pairs encourages modeling invariances, while contrasting negative pairs is thought to be required to prevent representational collapse
그러나 요즘 나온 BYOL, SimSiam 과 같은 논문들에서는 negative없이 성공했다.
근데 이 논문들이 왜 representation collapse를 겪지 않는건지는 아직 밝혀지지 않았다.
analyze the behavior of non-contrastive SSL training and the empirical effects of multiple hyperparameters, including (1) Exponential Moving Average (EMA) or momentum encoder, (2) Higher relative learning rate (αp) of the predictor, and (3) Weight decay η
We explain all these empirical finding with an exceedingly simple theory based on analyzing the nonlinear learning dynamics of simple linear networks.
2. Two-layer linear model
Theorem 1 (Weight decay promotes balancing of the predictor and online networks
Theorem 2 (The stop-gradient signal is essential for success.)
3. How multiple factors affect learning dynamics
4. Optimization-free Predictor $W_p$
노잼이라 그만...