kabachuha / sd-webui-text2video

Auto1111 extension implementing text2video diffusion models (like ModelScope or VideoCrafter) using only Auto1111 webui dependencies
Other
1.28k stars 107 forks source link

[Bug]: vid2vid throws an exception 'need at least one array to stack' #72

Closed james-s-tayler closed 1 year ago

james-s-tayler commented 1 year ago

Is there an existing issue for this?

Are you using the latest version of the extension?

What happened?

I was trying to use vid2vid but kept getting an exception.

Steps to reproduce the problem

  1. Go to vid2vid and upload a 1 minute video downloaded from YouTube
  2. input the prompt
  3. keep all other settings as default
  4. click generate

What should have happened?

It should have generated a video based on my prompt and the input video.

WebUI and Deforum extension Commit IDs

webui commit id - a9fed7c3 txt2vid commit id - 066a9e1

What GPU were you using for launching?

RTX 4090 24GB

On which platform are you launching the webui backend with the extension?

Local PC setup (Windows)

Settings

Windows 11, python: 3.10.10  •  torch: 2.0.0+cu118  •  xformers: 0.0.17+b6be33a.d20230315  •  gradio: 3.16.2

Steps: 30 Frames: 30 cfg_scale: 7 width/height: 256 seed: -1 eta: 0 denoising strength: 0.75 vid2vid start frame: tried both 1 and 200, but same result batch count: 1 VAE Mode: tried both GPU (half precision) and GPU, but same result

Console logs

`ModelScope text2video extension for auto1111 webui
Git commit: 066a9e13 (Sun Mar 26 15:10:21 2023)
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
got a request to *vid2vid* an existing video.
Trying to extract frames from video with input FPS of 23.976023976023978. Please wait patiently.
Successfully extracted 2244.0 frames from video.
Loading frames: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "C:\source\stable-diffusion-webui_clean\extensions\sd-webui-modelscope-text2video\scripts\modelscope-text2vid.py", line 125, in process
    images=np.stack(images)# f h w c
  File "<__array_function__ internals>", line 180, in stack
  File "C:\source\stable-diffusion-webui_clean\venv\lib\site-packages\numpy\core\shape_base.py", line 422, in stack
    raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack
Exception occurred: need at least one array to stack`

Additional information

https://www.youtube.com/watch?v=75rRs6fraUI&t=1s&ab_channel=VICENews was the video.

It appears it worked for a different video I input, but not this one. Maybe it's just allergic to BS?

github-actions[bot] commented 1 year ago

This issue has been closed due to incorrect formatting. Please address the following mistakes and reopen the issue: