Open Rts-Wu opened 1 month ago
Additionally, I have some other questions: 1.When I using auto regressive and the image and prompt in the paper, the result is not great enough. In my understanding, since the coherence of the images is maintained through the ref-image, it should be possible to generate images based on the current description. Why is the previous context still needed? What role does it play?
2.I observed that the coherence of the character image in the first frame generated from the ref-image is not as excellent as described in the paper, but it is still acceptable. I tried the example from the paper: "White cat," where the prev_p used was "A White Cat."
However, the coherence of the character image in the subsequent second frame seems to be lacking, possibly due to the first frame's context.
Can I keep prev_p as "A White Cat"? What drawbacks might this approach have?
3.During the inference process, there was a small error: "The config attributes {'center_input_sample': False} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file." Could this be a reason for the less than optimal results? I have directly used checkpoints you provided but results is not good enough, I have no ideas about that.
This is my settings: if name == "main": pretrained_model_path = '/root/autodl-tmp/StoryGen/checkpoint_StorySalon/' logdir = "./inference_StorySalon/" num_inference_steps = 40 guidance_scale = 7 image_guidance_scale = 3.5 num_sample_per_prompt = 10 mixed_precision = "fp16" stage = 'multi-image-condition' # ["multi-image-condition", "auto-regressive", "no"]
But the result is a bit ridiculous, images generated combining the cat and mouse as a new character like these:
and I also tried the ckpt from COCO(using inference,py, only changed pretrained_model_path = '/root/autodl-tmp/StoryGen/checkpoint_COCO/'):
Could you give some advice?