letme-hj commented 1 year ago

MAGVLT: based on non-autoregressive mask prediction.

enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement
extended editing capabilities such as image and text infilling

In contrast to AR-based generative VL transformer (ARGVLT), the proposed MAGVLT is able to exploit bidirectional conditioning and fast generation through a small number of refinement steps and parallel token predictions.

(비교:) ARGVLT (auto-regressive generative VL transformer) - 이것도 본인들이 만든 말이긴 한 듯 (일반적인 방법론 삼아 칭하는 말일 듯..?)

얘의 고질적인 문제가 있었다 함.
- undirectional attention
- slow decoding
- 그 와중 diffusion model이 그래도 최근에 엄청 remarkable한 성능을 보여줬다. (text-guided image generation 쪽에서는.) 하지만 text generation은 diffusion model로 어려웠다.

tasks to train the model

combination of
- image-to-text
- text-to-image
- joint image-and-text mask prediction
additional tasks (devised)
- step-unrolled mask prediction (SUNDAE 에서 inspired)
- selective prediction on the mixture of two image-text pairs.

접근

"Unified Generative Model" 하나의 모델에서 텍스트/이미지 두개 모두 생성 가능하도록 하는 것 (이쪽 연구도 없지는 않지만 흔한 접근은 아님)

최근에 같은 접근을 한 예시로는 Connecting representation and generation via masked vision-language transformer (찾아보니 reject 당함)가 있지만, 성능도 그닥이고 태스크도 이 논문보다 한정적이었다고 함.

letme-hj commented 1 year ago

Method

Masked Image-Text Modeling

Image input -- VQ-GAN --> latent X (16 x 16) Text input -- BPE --> tokenize Y (X, Y) -- special tokens added -- bidirectional transformer (full attention) -->

Bidirectional VS AR transformer

bidirectional: full attention (no attention mask)
ar: causal attention (attention mask o)

mask prediction losses

inference: iterative decoding

-모든 token을 parallel하게 decoding하는 거기 때문에, autoregressive decoding 보다 훨씬 빠르다.

iterative하게, 점점 masking ratio를 줄여가면서 predict함. (mask ratio function 다시 이해... : variable mask ratio라고 함)
auto-regressive가 아니기 때문에, target length prediction을 해야함. length predictor를 사용함. (이전 논문에 제안)
- X,Y 사이의 가 의미하는 게 text length
- loss는, ce 사용. (??)

step-unrolled mask prediction (UnrollMask)

training할 때 intermediate refinement step을 반영하기 위해서 variable mask ratio를 쓰긴 하지만, 그래도 train time에서 target tokens에 발생하는 corruption과 test time에서 partially predicted tokens에 발생하는 corruption은 차이가 난다. gap이 있다. (train-test 간의 gap) -> SUNDAE에서 이걸 해소하기 위한 방식 제시, non-AR autoencoder for text generation의 성능 향상. (how? by optimizing the model conditioned on a corrupted target sequence which is sampled through one step generative unrolling during training)
여기서 이걸 활용한 방식은, masked token prediction을 하기 때문에, one-step predicted sequence를 다시 remask. mask ratio는 줄어든 상태. 그 상태에서 remasked tokens를 MAGVLT로 예측함.
I2T, T2I에만 적용. cross-modal context는 uncorrupted로 유지하기 위해.

selective prediction on mixed context (MixSel)

multimodal generative modeling에서 자주 발생하는 현상: cross-modal context 반영 제대로 못하고 within-modal의 context만 참고한 듯한 결과물을 내뱉는 것. -> 이걸 좀 해소하기 위한 태스크 (새롭게 제안)
이미지를 연결해붙이거나 텍스트를 연결해붙임. text 상에서는 special token 추가하는 방향으로.
classifier-free guidance라는 것과 비슷하다고 함. (둘다 condition의 영향을 크게 하려는 것. 하지만 MixSel은 test time 때 forward processing을 두번 하지 않아도 된다고.)

letme-hj commented 1 year ago

Model

Image Encoder

VQ-GAN

Text Encoder

BPE tokenizer (CLIP에서 사용된 거)
text sequence len: 64

letme-hj / dl-papers

[7] MAGVLT: Masked Generative Vision-and-Language Transformer #7

tasks to train the model

접근

Method

Masked Image-Text Modeling

Bidirectional VS AR transformer

mask prediction losses

inference: iterative decoding

step-unrolled mask prediction (UnrollMask)

selective prediction on mixed context (MixSel)

Model

Image Encoder

Text Encoder