Open long8v opened 11 months ago
paper
Vision Encoder / Text Encoder는 CLIP거 사용 + reconstruction은 MAE처럼
$f_i^k$ : original image feature
$g_i^k$ : MAE로 reconstruct된 image patch의 feature
$\theta$ : proejction in vision encoder
KL divergence. $p_i^k$는 stop gradient.
최종 loss는 가중합. 2:1로 했다고
paper
TL;DR
Details
Overview
Masked Visual Reconstruction in Language Semantic Space
Vision Encoder / Text Encoder는 CLIP거 사용 + reconstruction은 MAE처럼
$f_i^k$ : original image feature
$g_i^k$ : MAE로 reconstruct된 image patch의 feature
$\theta$ : proejction in vision encoder
KL divergence. $p_i^k$는 stop gradient.
최종 loss는 가중합. 2:1로 했다고
Result
Ablation