haoningwu3639 / StoryGen

[CVPR 2024] Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models
https://haoningwu3639.github.io/StoryGen_Webpage/
MIT License
207 stars 11 forks source link

Results of inference #43

Open Rts-Wu opened 1 month ago

Rts-Wu commented 1 month ago

This is my settings: if name == "main": pretrained_model_path = '/root/autodl-tmp/StoryGen/checkpoint_StorySalon/' logdir = "./inference_StorySalon/" num_inference_steps = 40 guidance_scale = 7 image_guidance_scale = 3.5 num_sample_per_prompt = 10 mixed_precision = "fp16" stage = 'multi-image-condition' # ["multi-image-condition", "auto-regressive", "no"]

prompt = "The cat is running after the mouse"
prev_p = ["The cat" , "The mouse"]
ref_image = ["./Tom.png" , ",/Jerry.png"]

But the result is a bit ridiculous, images generated combining the cat and mouse as a new character like these:

image

and I also tried the ckpt from COCO(using inference,py, only changed pretrained_model_path = '/root/autodl-tmp/StoryGen/checkpoint_COCO/'):

image Could you give some advice?

Rts-Wu commented 1 month ago

Additionally, I have some other questions: 1.When I using auto regressive and the image and prompt in the paper, the result is not great enough. In my understanding, since the coherence of the images is maintained through the ref-image, it should be possible to generate images based on the current description. Why is the previous context still needed? What role does it play?

2.I observed that the coherence of the character image in the first frame generated from the ref-image is not as excellent as described in the paper, but it is still acceptable. I tried the example from the paper: "White cat," where the prev_p used was "A White Cat." image

However, the coherence of the character image in the subsequent second frame seems to be lacking, possibly due to the first frame's context.

image image

Can I keep prev_p as "A White Cat"? What drawbacks might this approach have?

3.During the inference process, there was a small error: "The config attributes {'center_input_sample': False} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file." Could this be a reason for the less than optimal results? I have directly used checkpoints you provided but results is not good enough, I have no ideas about that.