Open cc13qq opened 1 week ago
The input prompt and image is: prompt = "The white dog is singing" prev_p = ["The white dog"] ref_image = ["./data/image/00001.png"]
This is a single inference. The 10 generated images correspond to the results of 10 different random seeds (10 times independent sampling), which is not a coherent story. Besides, directly using real images as condition may exist domain gaps with our story-style checkpoint, which will somehow impact the generation quality.
I see. So, how do you make a coherent story composed of several images and texts, just like what you did in the paper?
You can start by generating the first frame of the story in single-frame mode (set stage = ‘no’). Then, use the generated frame as the ref_image to generate the next frame of the story, iteratively producing a coherent and complete story in an autoregressive manner.
Thanks for your code and checkpoints. I tried to run inference.py and expected to generate a story with a series of image-text pairs. But I got this:
I don't know what the results mean. Are they a series of images that compose a story? If so, where are the descriptions of these images?