huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.21k stars 5.4k forks source link

[I2VGen-XL] The Official Inference Script of I2VGenXLPipline does not Function Well #8429

Closed AlonzoLeeeooo closed 5 months ago

AlonzoLeeeooo commented 5 months ago

Describe the bug

Hi,

Thank you for your amazing work of diffusers, which has brought a whole lot of convenience to researchers of diffusion models and image synthesis.

I downloaded the diffusers model weights of I2VGen-XL and was trying to run its official inference script, which is shown below:

import torch
from diffusers import I2VGenXLPipeline
from diffusers.utils import load_image, export_to_gif

repo_id = "ali-vilab/i2vgen-xl" 
pipeline = I2VGenXLPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16").to("cuda")

image_url = "https://github.com/ali-vilab/i2vgen-xl/blob/main/data/test_images/img_0009.png?download=true"
image = load_image(image_url).convert("RGB")
prompt = "Papers were floating in the air on a table in the library"

generator = torch.manual_seed(8888)
frames = pipeline(
    prompt=prompt,
    image=image,
    generator=generator
).frames[0]

print(export_to_gif(frames))

But the code came into the following error:

Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████| 7/7 [00:37<00:00,  5.42s/it]
Traceback (most recent call last):
  File "/mnt/e/code/video-generation/video-object-replacement/anyv2v/i2vgenxl-inference.py", line 36, in <module>
    frames = pipeline(
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/diffusers/pipelines/i2vgen_xl/pipeline_i2vgen_xl.py", line 635, in __call__
    image_embeddings = self._encode_image(cropped_image, device, num_videos_per_prompt)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/diffusers/pipelines/i2vgen_xl/pipeline_i2vgen_xl.py", line 338, in _encode_image
    image_embeddings = self.image_encoder(image).image_embeds
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 1299, in forward
    vision_outputs = self.vision_model(
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 854, in forward
    hidden_states = self.embeddings(pixel_values)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 191, in forward
    patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 3, 3, 224, 224]

I have checked into the source code of diffusers, and found the problem might be caused by self.feature_extractor (line 328 in pipeline_i2vgen_xl.py), where before line 328 the shape of image is [1, 3, 224, 224] and after it becomes [1, 3, 3 224, 224]. I did not do any further exploration for this afterward.

For the environment, I was using diffusers==0.28.2, so you can re-produce this issue with the corresponding environment.

Please enlighten me if I am setting anything wrong.

Best regards

Reproduction

import torch
from diffusers import I2VGenXLPipeline
from diffusers.utils import load_image, export_to_gif

repo_id = "ali-vilab/i2vgen-xl" 
pipeline = I2VGenXLPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16").to("cuda")

image_url = "https://github.com/ali-vilab/i2vgen-xl/blob/main/data/test_images/img_0009.png?download=true"
image = load_image(image_url).convert("RGB")
prompt = "Papers were floating in the air on a table in the library"

generator = torch.manual_seed(8888)
frames = pipeline(
    prompt=prompt,
    image=image,
    generator=generator
).frames[0]

print(export_to_gif(frames))

Logs

Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████| 7/7 [00:37<00:00,  5.42s/it]
Traceback (most recent call last):
  File "/mnt/e/code/video-generation/video-object-replacement/anyv2v/i2vgenxl-inference.py", line 36, in <module>
    frames = pipeline(
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/diffusers/pipelines/i2vgen_xl/pipeline_i2vgen_xl.py", line 635, in __call__
    image_embeddings = self._encode_image(cropped_image, device, num_videos_per_prompt)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/diffusers/pipelines/i2vgen_xl/pipeline_i2vgen_xl.py", line 338, in _encode_image
    image_embeddings = self.image_encoder(image).image_embeds
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 1299, in forward
    vision_outputs = self.vision_model(
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 854, in forward
    hidden_states = self.embeddings(pixel_values)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 191, in forward
    patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/liuchang/anaconda3/envs/animatediff/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 3, 3, 224, 224]

System Info

Diffusers version: diffusers==0.28.2 OS environment: wsl2 GPU: RTX NVIDIA 3090

Who can help?

No response

tolgacangoz commented 5 months ago

Hi @AlonzoLeeeooo, When I use your reproduction code, this is thrown at image = load_image(image_url).convert("RGB"):

UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x78e32c68a3e0>

In Colab, I couldn't reproduce the error you got. Could you double-check? The location of the image was transferred. It seemed to work when I tried this:

image_url = "https://raw.githubusercontent.com/ali-vilab/VGen/main/data/test_images/img_0009.png"
tolgacangoz commented 5 months ago

You can also take a reference from the documentation.

AlonzoLeeeooo commented 5 months ago

Hi @tolgacangoz, Thanks very much for your fast response and information. I will try fixing the problem with your solutions and update with you.

Best

AlonzoLeeeooo commented 5 months ago

Hi @tolgacangoz, Thanks very much for your fast response and information. I will try fixing the problem with your solutions and update with you.

Best

Thank you so much for you help @tolgacangoz . I have updated from transformers==4.25.2 to transformers==4.41.2. And the problem is now fixed.