The output is black image.

showlab / DragAnything

[ECCV 2024] DragAnything: Motion Control for Anything using Entity Representation

417 stars 13 forks source link

The output is black image. #17

Closed T0L0ve closed 6 months ago

T0L0ve commented 7 months ago

I use the fp16 model of svd_xt,but the output img is black image. controlnet = DragAnythingSDVModel.from_pretrained(args["DragAnything"],local_files_only=True,torch_dtype=torch.float16)

unet = UNetSpatioTemporalConditionControlNetModel.from_pretrained(args["pretrained_model_name_or_path"],subfolder="unet",torch_dtype=torch.float16, variant="fp16",local_files_only=True)

pipeline = StableVideoDiffusionPipeline.from_pretrained(args["pretrained_model_name_or_path"],controlnet=controlnet,unet=unet,local_files_only=True,torch_dtype=torch.float16, variant="fp16") temp_0_20240329-172303 temp_1_20240329-172303 one of the output is black gif,and I get an error EOFError: no more images in GIF file.

weijiawu commented 7 months ago

This is likely due to incorrect frame output. Perhaps you could check if the output image shape is normal.

T0L0ve commented 7 months ago

The output image shape is （320,576,3）,I didn't change the input image and args in demo.py.

T0L0ve commented 7 months ago

When I use fp16 I get an AttributeError: 'StableVideoDiffusionPipeline' object has no attribute 'dinov2'. Maybe there's something wrong here? if needs_upcasting: self.vae.to(dtype=torch.float16) self.dinov2.to(dtype=torch.float16)

weijiawu commented 7 months ago

dinov2

You can directly remove all the code related to Dino 2. The current version does not utilize Dino 2.

weijiawu commented 7 months ago

The output image shape is （320,576,3）,I didn't change the input image and args in demo.py.

The output appears to be a single image, which is incorrect. The output should be a list of images.

T0L0ve commented 7 months ago

The output image shape is （320,576,3）,I didn't change the input image and args in demo.py.

The output appears to be a single image, which is incorrect. The output should be a list of images.

I just print one of the image list. The frames shape after decode_latents is torch.Size([1, 3, 20, 320, 576]). frames = self.decode_latents(latents, num_frames, decode_chunk_size) print(frames.shape) I find the image array is nan.

T0L0ve commented 7 months ago

I solved the problem by change the torch version from 1.13 to 2.1.2