"RuntimeError: Input type (c10::Half) and bias type (float) should be the same" when running examples/make_music_video.py

philgzl commented 1 year ago

When trying to generate a music video using examples/make_music_video.py locally, I get the following error:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /zhome/d6/0/134239/stable-diffusion-videos/examples/make_music_video.py:49   │
│ in <module>                                                                  │
│                                                                              │
│   46 │   1326004,                                                            │
│   47 │   5019608,                                                            │
│   48 ]                                                                       │
│ ❱ 49 pipe.walk(                                                              │
│   50 │   prompts=prompts,                                                    │
│   51 │   seeds=seeds,                                                        │
│   52 │   num_interpolation_steps=num_interpolation_steps,                    │
│                                                                              │
│ /zhome/d6/0/134239/stable-diffusion-videos/stable_diffusion_videos/stable_di │
│ ffusion_pipeline.py:840 in walk                                              │
│                                                                              │
│   837 │   │   │   audio_offset = audio_start_sec + sum(num_interpolation_ste │
│   838 │   │   │   audio_duration = num_step / fps                            │
│   839 │   │   │                                                              │
│ ❱ 840 │   │   │   self.make_clip_frames(                                     │
│   841 │   │   │   │   prompt_a,                                              │
│   842 │   │   │   │   prompt_b,                                              │
│   843 │   │   │   │   seed_a,                                                │
│                                                                              │
│ /zhome/d6/0/134239/stable-diffusion-videos/stable_diffusion_videos/stable_di │
│ ffusion_pipeline.py:624 in make_clip_frames                                  │
│                                                                              │
│   621 │   │                                                                  │
│   622 │   │   frame_index = skip                                             │
│   623 │   │   for _, embeds_batch, noise_batch in batch_generator:           │
│ ❱ 624 │   │   │   outputs = self(                                            │
│   625 │   │   │   │   latents=noise_batch,                                   │
│   626 │   │   │   │   text_embeddings=embeds_batch,                          │
│   627 │   │   │   │   height=height,                                         │
│                                                                              │
│ /zhome/d6/0/134239/stable-diffusion-videos/venv/lib/python3.10/site-packages │
│ /torch/autograd/grad_mode.py:27 in decorate_context                          │
│                                                                              │
│    24 │   │   @functools.wraps(func)                                         │
│    25 │   │   def decorate_context(*args, **kwargs):                         │
│    26 │   │   │   with self.clone():                                         │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                           │
│    28 │   │   return cast(F, decorate_context)                               │
│    29 │                                                                      │
│    30 │   def _wrap_generator(self, func):                                   │
│                                                                              │
│ /zhome/d6/0/134239/stable-diffusion-videos/stable_diffusion_videos/stable_di │
│ ffusion_pipeline.py:527 in __call__                                          │
│                                                                              │
│   524 │   │   │   │   callback(i, t, latents)                                │
│   525 │   │                                                                  │
│   526 │   │   latents = 1 / 0.18215 * latents                                │
│ ❱ 527 │   │   image = self.vae.decode(latents).sample                        │
│   528 │   │                                                                  │
│   529 │   │   image = (image / 2 + 0.5).clamp(0, 1)                          │
│   530                                                                        │
│                                                                              │
│ /zhome/d6/0/134239/stable-diffusion-videos/venv/lib/python3.10/site-packages │
│ /diffusers/models/vae.py:605 in decode                                       │
│                                                                              │
│   602 │   │   │   decoded_slices = [self._decode(z_slice).sample for z_slice │
│   603 │   │   │   decoded = torch.cat(decoded_slices)                        │
│   604 │   │   else:                                                          │
│ ❱ 605 │   │   │   decoded = self._decode(z).sample                           │
│   606 │   │                                                                  │
│   607 │   │   if not return_dict:                                            │
│   608 │   │   │   return (decoded,)                                          │
│                                                                              │
│ /zhome/d6/0/134239/stable-diffusion-videos/venv/lib/python3.10/site-packages │
│ /diffusers/models/vae.py:576 in _decode                                      │
│                                                                              │
│   573 │   │   return AutoencoderKLOutput(latent_dist=posterior)              │
│   574 │                                                                      │
│   575 │   def _decode(self, z: torch.FloatTensor, return_dict: bool = True)  │
│ ❱ 576 │   │   z = self.post_quant_conv(z)                                    │
│   577 │   │   dec = self.decoder(z)                                          │
│   578 │   │                                                                  │
│   579 │   │   if not return_dict:                                            │
│                                                                              │
│ /zhome/d6/0/134239/stable-diffusion-videos/venv/lib/python3.10/site-packages │
│ /torch/nn/modules/module.py:1194 in _call_impl                               │
│                                                                              │
│   1191 │   │   # this function, and just call forward.                       │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._ │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                     │
│   1195 │   │   # Do not call functions when jit is used                      │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []         │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:            │
│                                                                              │
│ /zhome/d6/0/134239/stable-diffusion-videos/venv/lib/python3.10/site-packages │
│ /torch/nn/modules/conv.py:463 in forward                                     │
│                                                                              │
│    460 │   │   │   │   │   │   self.padding, self.dilation, self.groups)     │
│    461 │                                                                     │
│    462 │   def forward(self, input: Tensor) -> Tensor:                       │
│ ❱  463 │   │   return self._conv_forward(input, self.weight, self.bias)      │
│    464                                                                       │
│    465 class Conv3d(_ConvNd):                                                │
│    466 │   __doc__ = r"""Applies a 3D convolution over an input signal compo │
│                                                                              │
│ /zhome/d6/0/134239/stable-diffusion-videos/venv/lib/python3.10/site-packages │
│ /torch/nn/modules/conv.py:459 in _conv_forward                               │
│                                                                              │
│    456 │   │   │   return F.conv2d(F.pad(input, self._reversed_padding_repea │
│    457 │   │   │   │   │   │   │   weight, bias, self.stride,                │
│    458 │   │   │   │   │   │   │   _pair(0), self.dilation, self.groups)     │
│ ❱  459 │   │   return F.conv2d(input, weight, bias, self.stride,             │
│    460 │   │   │   │   │   │   self.padding, self.dilation, self.groups)     │
│    461 │                                                                     │
│    462 │   def forward(self, input: Tensor) -> Tensor:                       │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Input type (c10::Half) and bias type (float) should be the same

However when using the snippet in README.md everything works well.

Atomic-Germ commented 1 year ago

Can you give a description of your environment? Python version, OS and version, etc.

philgzl commented 1 year ago

Python: 3.10.7 OS: Scientific Linux 7.9 (Nitrogen)

It seems like what is causing the issue is setting vae=AutoencoderKL.from_pretrained(f"stabilityai/sd-vae-ft-ema") when initializing the StableDiffusionWalkPipeline here. This is not done in the code snippet in README.md, and commenting this line out fixes the issue.

Maybe the script should be updated to also use SD 2.1. Maybe I open a PR.

lingster commented 1 year ago

alternatively you could resolve using this:

        revision = "fp16"
        model_path = "runwayml/stable-diffusion-v1-5"

# add this into your StableDiffusionWalkPipeline(): 

        vae=AutoencoderKL.from_pretrained(
            model_path,
            subfolder="vae",
            revision=revision,
            torch_dtype=torch_dtype
        ),

nateraw / stable-diffusion-videos

"RuntimeError: Input type (c10::Half) and bias type (float) should be the same" when running examples/make_music_video.py #150