Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models

Pan et al., Technical Report 2022

Conditioned diffusion models have demonstrated state-of-the-art text-to-image synthesis capacity. Recently, most works focus on synthesizing independent images; While for real-world applications, it is common and necessary to generate a series of coherent images for story-stelling. In this work, we mainly focus on story visualization and continuation tasks and propose AR-LDM, a latent diffusion model auto-regressively conditioned on history captions and generated images. Moreover, AR-LDM can generalize to new characters through adaptation. To our best knowledge, this is the first work successfully leveraging diffusion models for coherent visual story synthesizing. Quantitative results show that AR-LDM achieves SoTA FID scores on PororoSV, FlintstonesSV, and the newly introduced challenging dataset VIST containing natural images. Large-scale human evaluations show that AR-LDM has superior performance in terms of quality, relevance, and consistency.

🔑 Key idea:

Auto-regressive approach by inputting sampled (generated) image of the previous timestep
Multimodal conditioning vector using CLIP and BLIP for latent Autoencoder modulation
Finetuning on pretrained Stable diffusion (LAION-5B pretrained SD) with frozen initial encoder and terminal decoder.

💪 Strength:

Proving performances on both Story visualization (SV, without initial hint) and Story continuation (SC, with an initial hint, outputs 4 images) tasks.
Using modified VisualStorytelling (VIST, narrative multiple caption generating task) dataset and proven its performance on realistic story synthesis.

😵 Weakness:

Heavy computational cost : 2days / 50 epoch / 8 GPUs (A100)
- SD sampling step is slow but slower: because the auto-regressive method "waits" the image from the previous timestep to be sampled (generated). SC and SV consist of multiple images to be auto-regressively sampled, thus it is expected to be time-consuming.
- C.f., toy experiments (single text-to-image synthesis on SD) required 2 days on 4 GPUs (Quadro 8000)
Yet, not presenting novel evaluation metrics such as unseen episodes. It only measures visual qualities (FID).

🤔 Confidence:

Medium

✏️ Memo:

I am thinking of a new task: Bidirectional Story Visualization, which is a combination of VIST and SV. Storytelling-to-visualization task (image-to-text-to-image).
- it could close the gap between loose illustrative correspondence with the text

jnhwkim / Pensees

Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models #9

Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models

🔑 Key idea:

💪 Strength:

😵 Weakness:

🤔 Confidence:

✏️ Memo: