long8v / PTIR

Paper Today I Read
19 stars 0 forks source link

[125] RILS: Masked Visual Reconstruction in Language Semantic Space #137

Open long8v opened 11 months ago

long8v commented 11 months ago
image

paper

TL;DR

Details

Overview

image image

Masked Visual Reconstruction in Language Semantic Space

Vision Encoder / Text Encoder는 CLIP거 사용 + reconstruction은 MAE처럼

image image image image

KL divergence. $p_i^k$는 stop gradient.

image

최종 loss는 가중합. 2:1로 했다고

Result

image image image image image image image

Ablation

image image