The work shows that prompt following abilities of text-to-image models can be substantially improved by training on highly descriptive generated image captions. Note that DALLE3 has other improvements from DALLE2 that are not revealed in this work.
Method
The authors found previous text-to-image model tends to ignore the details of the prompts due to the simple descriptions generated from human authors. Human authors focus on simple descriptions the subject of the image and omit background details or common sense relationships portrayed in image. Important details that are commonly omitted from t might include:
The presence of objects like sinks in a kitchen or stop signs along a sidewalk and descriptions of those objects.
The position of objects in a scene and the number of those objects.
Common sense details like the colors and sizes of objects in a scene.
The text that is displayed in an image.
As a result, authors trained a captioner to generate highly descriptive text-image pairs
train captioner on human-labelled dataset
build a small dataset that has the discription only for the main subject, and fine-tune captioner on it
use the same small dataset in 2. but add detailed description on the whole images, including background, color and then fine-tune captioner on it
apply this captioner on whole dataset
Highlight
In inference, they use GPT-4 to "upsample" simple prompt to get more detailed image. Their prompt can be found in appendix
The model is trained to better generate text image but the process is not revealed
They a technique called Drawbench that uses GPT-V to compare the results with DALLE2 and SDXL,
Introduction
The work shows that prompt following abilities of text-to-image models can be substantially improved by training on highly descriptive generated image captions. Note that DALLE3 has other improvements from DALLE2 that are not revealed in this work.
Method
The authors found previous text-to-image model tends to ignore the details of the prompts due to the simple descriptions generated from human authors. Human authors focus on simple descriptions the subject of the image and omit background details or common sense relationships portrayed in image. Important details that are commonly omitted from t might include:
As a result, authors trained a captioner to generate highly descriptive text-image pairs
Highlight
Limitation
Comments