pomelyu / paper-reading-notes

0 stars 0 forks source link

2023 (DALLE3) Improving Image Generation with Better Captions #21

Open pomelyu opened 8 months ago

pomelyu commented 8 months ago

Introduction

image

The work shows that prompt following abilities of text-to-image models can be substantially improved by training on highly descriptive generated image captions. Note that DALLE3 has other improvements from DALLE2 that are not revealed in this work.

Method

The authors found previous text-to-image model tends to ignore the details of the prompts due to the simple descriptions generated from human authors. Human authors focus on simple descriptions the subject of the image and omit background details or common sense relationships portrayed in image. Important details that are commonly omitted from t might include:

  1. The presence of objects like sinks in a kitchen or stop signs along a sidewalk and descriptions of those objects.
  2. The position of objects in a scene and the number of those objects.
  3. Common sense details like the colors and sizes of objects in a scene.
  4. The text that is displayed in an image.

As a result, authors trained a captioner to generate highly descriptive text-image pairs

  1. train captioner on human-labelled dataset
  2. build a small dataset that has the discription only for the main subject, and fine-tune captioner on it
  3. use the same small dataset in 2. but add detailed description on the whole images, including background, color and then fine-tune captioner on it
  4. apply this captioner on whole dataset

Highlight

Limitation

Comments