[Bug]: Tensor size mismatch when trying to generate video of different size

adhityaswami commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits of both this extension and the webui

Are you using the latest version of the extension?

[X] I have the modelscope text2video extension updated to the lastest version and I still have the issue.

What happened?

I tried generating a video with 384x216 dimensions (16:9) aspect ratio basically with my custom trained converted model. However I get the following error:

DDIM sampling: 0%| | 0/50 [00:00<?, ?it/s] Traceback (most recent call last): | 0/50 [00:00<?, ?it/s] File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/t2v_helpers/render.py", line 27, in run vids_pack = process_modelscope(args_dict) File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/process_modelscope.py", line 209, in processmodelscope samples, = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale, File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_pipeline.py", line 258, in infer x0 = self.diffusion.ddim_sample_loop( File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1485, in ddim_sample_loop xt = self.ddim_sample(xt, t, model, model_kwargs, clamp, File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1334, in ddimsample , , , x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1275, in p_mean_variance y_out = model(xt, self._scale_timesteps(t), model_kwargs[0]) File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 380, in forward x = torch.cat([x, xs.pop()], dim=1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list. Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

This occurs even when using the original model.

Steps to reproduce the problem

Go to the UI
Try generating a video with width = 384 and height = 216

What should have happened?

Should be generating a video of the required dimensions.

WebUI and Deforum extension Commit IDs

webui commit id - baf6946e06249c5af9851c60171692c44ef633e0 txt2vid commit id - a44078d1cc6a75f619037a63f3e26a483965b826

Torch version

2.0.1+cu118

What GPU were you using for launching?

NVIDIA A10G - 24GB

On which platform are you launching the webui backend with the extension?

Cloud server (Linux)

Settings

Console logs

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on ubuntu user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
python venv already activate: /home/ubuntu/text2vid/stable-diffusion-webui/venv
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc.so.4
Python 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0]
Version: v1.3.2
Commit hash: baf6946e06249c5af9851c60171692c44ef633e0
Installing requirements

Launching Web UI with arguments: --listen
No module 'xformers'. Proceeding without it.
Loading weights [6ce0161689] from /home/ubuntu/text2vid/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Creating model from config: /home/ubuntu/text2vid/stable-diffusion-webui/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 4.4s (import torch: 0.9s, import gradio: 0.9s, import ldm: 0.4s, other imports: 0.8s, load scripts: 0.5s, create ui: 0.6s, gradio launch: 0.1s).
DiffusionWrapper has 859.52 M params.
Applying optimization: Doggettx... done.
Textual inversion embeddings loaded(0):
Model loaded in 1.7s (load weights from disk: 0.2s, create model: 0.9s, apply weights to model: 0.2s, apply half(): 0.1s, move model to device: 0.2s).
text2video — The model selected is:  ModelScope
 text2video extension for auto1111 webui
Git commit: a44078d1
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                  | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Blonde woman walking in a forest, dense foliage, pink leaves', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 40, 'seed': 3586594887, 'scale': 17, 'width': 384, 'height': 216, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 0}
latents torch.Size([1, 4, 40, 27, 48]) tensor(-0.0010, device='cuda:0') tensor(0.9960, device='cuda:0')
DDIM sampling:   0%|                                                  | 0/31 [00:00<?, ?it/s]
Traceback (most recent call last):                                    | 0/31 [00:00<?, ?it/s]
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/t2v_helpers/render.py", line 27, in run
    vids_pack = process_modelscope(args_dict)
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/process_modelscope.py", line 209, in process_modelscope
    samples, _ = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale,
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_pipeline.py", line 258, in infer
    x0 = self.diffusion.ddim_sample_loop(
  File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1485, in ddim_sample_loop
    xt = self.ddim_sample(xt, t, model, model_kwargs, clamp,
  File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1334, in ddim_sample
    _, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp,
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1275, in p_mean_variance
    y_out = model(xt, self._scale_timesteps(t), **model_kwargs[0])
  File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 380, in forward
    x = torch.cat([x, xs.pop()], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.
Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

Additional information

No response

B34STW4RS commented 1 year ago

I don't think this is a bug, this is how SD worked before. The problem here it is setting the torch.size to an odd number, in this instance 27. Which is indivisible by 4. Best to use the slider to choose a resolution close to what you need and either crop it or squeeze it. I'm not sure what was changed in SD to support odd sizes, or when the change was implemented exactly.

ie: try to make a 720 wide video

Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': '', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 24, 'seed': 2563507479, 'scale': 17, 'width': 720, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 0} latents torch.Size([1, 4, 24, 32, 90]) tensor(-0.0008, device='cuda:0') tensor(0.9997, device='cuda:0') DDIM sampling: 0%| | 0/31 [00:00<?, ?it/s] Traceback (most recent call last): | 0/31 [00:00<?, ?it/s] File "D:\NasD\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\t2v_helpers\render.py", line 24, in run vids_pack = process_modelscope(args_dict) File "D:\NasD\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\modelscope\process_modelscope.py", line 205, in processmodelscope samples, = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale, File "D:\NasD\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\modelscope\t2v_pipeline.py", line 253, in infer x0 = self.diffusion.ddim_sample_loop( File "D:\NasD\stable-diffusion-webui\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1475, in ddim_sample_loop xt = self.ddim_sample(xt, t, model, model_kwargs, clamp, File "D:\NasD\stable-diffusion-webui\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1324, in ddimsample , , , x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1265, in p_mean_variance y_out = model(xt, self._scale_timesteps(t), model_kwargs[0]) File "D:\NasD\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 380, in forward x = torch.cat([x, xs.pop()], dim=1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 24 but got size 23 for tensor number 1 in the list. Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 24 but got size 23 for tensor number 1 in the list.

size is now 90, NG. etc.

adhityaswami commented 1 year ago

Hey looks like you were right. It does work in SD normally though, so I'll check out what the change is and try to implement it in the extension as well.

tl;dr for anyone facing this issue: Make sure your resolutions are divisible by 32

kabachuha / sd-webui-text2video