haoningwu3639 / StoryGen

[CVPR 2024] Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models
https://haoningwu3639.github.io/StoryGen_Webpage/
MIT License
196 stars 9 forks source link

What the inference results mean? #35

Open cc13qq opened 1 week ago

cc13qq commented 1 week ago

Thanks for your code and checkpoints. I tried to run inference.py and expected to generate a story with a series of image-text pairs. But I got this: Screenshot 2024-09-03 172755

I don't know what the results mean. Are they a series of images that compose a story? If so, where are the descriptions of these images?

cc13qq commented 1 week ago

The input prompt and image is: prompt = "The white dog is singing" prev_p = ["The white dog"] ref_image = ["./data/image/00001.png"]

haoningwu3639 commented 1 week ago

This is a single inference. The 10 generated images correspond to the results of 10 different random seeds (10 times independent sampling), which is not a coherent story. Besides, directly using real images as condition may exist domain gaps with our story-style checkpoint, which will somehow impact the generation quality.

cc13qq commented 1 week ago

I see. So, how do you make a coherent story composed of several images and texts, just like what you did in the paper?

haoningwu3639 commented 1 week ago

You can start by generating the first frame of the story in single-frame mode (set stage = ‘no’). Then, use the generated frame as the ref_image to generate the next frame of the story, iteratively producing a coherent and complete story in an autoregressive manner.