haoningwu3639 / StoryGen

[CVPR 2024] Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models
https://haoningwu3639.github.io/StoryGen_Webpage/
MIT License
207 stars 11 forks source link

inference setting #39

Open Huni0318 opened 1 month ago

Huni0318 commented 1 month ago

Hello, thank you for providing the code and checkpoints. I would like to generate stories similar to those inferred in the paper, but the results I am getting are as follows. Could you please share the exact inference settings? Here are my current settings.

pretrained_model_path = '/mnt/tmp/StoryGen/StoryGen/checkpoint_StorySalon'
logdir = "./inference_StorySalon/"
num_inference_steps = 40
guidance_scale = 7
image_guidance_scale = 3.5
num_sample_per_prompt = 3
mixed_precision = "fp16"
stage = 'auto-regressive' # ["multi-image-condition", "auto-regressive", "no"]

prompt = "The white rabbit's favorite pastime was hopping through the meadow…"
prev_p = ["The white rabbit", "Once upon a time,in a tranquil meadow, there lived a fluffy white rabbit…"]
ref_image = ["/mnt/tmp/StoryGen/data/image/white_rabbit3.png", "/mnt/tmp/StoryGen/inference_StorySalon/_241004-072947/66763_output.png"]

test(pretrained_model_path, 
     logdir, 
     prompt, 
     prev_p, 
     ref_image, 
     num_inference_steps, 
     guidance_scale, 
     image_guidance_scale, 
     num_sample_per_prompt,
     stage, 
     mixed_precision)
Huni0318 commented 1 month ago
스크린샷 2024-10-08 오후 5 23 27

even though I used the images from the paper, it is not easy to reproduce the result as shown in the paper.

and in stage no, The model requires prev_p and a reference image (ref image). How should I provide these? Isn't the stage no intended for generating the first frame?

haoningwu3639 commented 1 month ago

Please refer to our paper for the distinction between narrative text and descriptive text. Considering that descriptive text is more suitable as prompts for text-to-image models, we transformed the stories generated by GPT-4 into corresponding descriptive text, which was then fed into our model.

Additionally, when generating the first frame, there is no need to provide prev_p or a reference image. You can either modify the code accordingly, or pass in any image, as it will not be used as a contextual condition in the generation process.