[Bug]: vid2vid throws an exception 'need at least one array to stack'

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits of both this extension and the webui

Are you using the latest version of the extension?

[X] I have the modelscope text2video extension updated to the lastest version and I still have the issue.

What happened?

I was trying to use vid2vid but kept getting an exception.

Steps to reproduce the problem

Go to vid2vid and upload a 1 minute video downloaded from YouTube
input the prompt
keep all other settings as default
click generate

What should have happened?

It should have generated a video based on my prompt and the input video.

WebUI and Deforum extension Commit IDs

webui commit id - a9fed7c3 txt2vid commit id - 066a9e1

What GPU were you using for launching?

RTX 4090 24GB

On which platform are you launching the webui backend with the extension?

Local PC setup (Windows)

Settings

Windows 11, python: 3.10.10 • torch: 2.0.0+cu118 • xformers: 0.0.17+b6be33a.d20230315 • gradio: 3.16.2

Steps: 30 Frames: 30 cfg_scale: 7 width/height: 256 seed: -1 eta: 0 denoising strength: 0.75 vid2vid start frame: tried both 1 and 200, but same result batch count: 1 VAE Mode: tried both GPU (half precision) and GPU, but same result

Console logs

`ModelScope text2video extension for auto1111 webui
Git commit: 066a9e13 (Sun Mar 26 15:10:21 2023)
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
got a request to *vid2vid* an existing video.
Trying to extract frames from video with input FPS of 23.976023976023978. Please wait patiently.
Successfully extracted 2244.0 frames from video.
Loading frames: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "C:\source\stable-diffusion-webui_clean\extensions\sd-webui-modelscope-text2video\scripts\modelscope-text2vid.py", line 125, in process
    images=np.stack(images)# f h w c
  File "<__array_function__ internals>", line 180, in stack
  File "C:\source\stable-diffusion-webui_clean\venv\lib\site-packages\numpy\core\shape_base.py", line 422, in stack
    raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack
Exception occurred: need at least one array to stack`

Additional information

https://www.youtube.com/watch?v=75rRs6fraUI&t=1s&ab_channel=VICENews was the video.

It appears it worked for a different video I input, but not this one. Maybe it's just allergic to BS?

kabachuha / sd-webui-text2video