[Bug]: Anything beyond 45 frames fails with weird geometric shapes

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits of both this extension and the webui
Are you using the latest version of the extension?

[X] I have the modelscope text2video extension updated to the lastest version and I still have the issue.
What happened?

I tried to produce video from the Zeroscope 576w beyond 30 frames and I get garbage, especially around 60 frames.
Steps to reproduce the problem

Prompt: A monster made of food, opening it's mouth
Set resolution to 448x256
Render
Observe this type of output
https://github.com/kabachuha/sd-webui-text2video/assets/3253286/059a255f-0694-49e3-be13-846443bb483b
What should have happened?

I should have output like this only longer:
https://github.com/kabachuha/sd-webui-text2video/assets/3253286/10468a79-b20c-421c-97ee-b079b2b8bd74
WebUI and Deforum extension Commit IDs

webui commit id - [6ce0161689] txt2vid commit id - [3f4a109a]
Torch version

2.0.0+cu118
What GPU were you using for launching?

RTX 4090 - 24gb ram
On which platform are you launching the webui backend with the extension?

Local PC setup (Windows)
Settings

Console logs

venv "C:\sdcurrent\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: v1.4.1
Commit hash: f865d3e11647dfd6c7b2cdf90dde24680e58acd8
Installing requirements
Installing None
Installing onnxruntime-gpu...
Installing None
Installing opencv-python...
Installing None
Installing Pillow...

Checking roop requirements
Install insightface==0.7.3
Installing sd-webui-roop requirement: insightface==0.7.3
Install onnx==1.14.0
Installing sd-webui-roop requirement: onnx==1.14.0
Install onnxruntime==1.15.0
Installing sd-webui-roop requirement: onnxruntime==1.15.0
Install opencv-python==4.7.0.72
Installing sd-webui-roop requirement: opencv-python==4.7.0.72

If submitting an issue on github, please provide the full startup log for debugging purposes.

Initializing Dreambooth
Dreambooth revision: c2a5617c587b812b5a408143ddfb18fc49234edf
Successfully installed accelerate-0.19.0 fastapi-0.94.1 gitpython-3.1.32 transformers-4.30.2

Does your project take forever to startup?
Repetitive dependency installation may be the reason.
Automatic1111's base project sets strict requirements on outdated dependencies.
If an extension is using a newer version, the dependency is uninstalled and reinstalled twice every startup.

[+] xformers version 0.0.20 installed.
[+] torch version 2.0.0+cu118 installed.
[+] torchvision version 0.15.1+cu118 installed.
[+] accelerate version 0.19.0 installed.
[+] diffusers version 0.16.1 installed.
[+] transformers version 4.30.2 installed.
[+] bitsandbytes version 0.35.4 installed.

Launching Web UI with arguments: --no-gradio-queue --disable-safe-unpickle --opt-sdp-attention --opt-channelslast --xformers --autolaunch --ckpt-dir P:\Stable Diffusion Checkpoints
C:\sdcurrent\venv\lib\site-packages\pkg_resources\__init__.py:123: PkgResourcesDeprecationWarning: llow is an invalid version and will not be supported in a future release
  warnings.warn(
Civitai Helper: Get Custom Model Folder
Civitai Helper: Load setting from: C:\sdcurrent\extensions\Stable-Diffusion-Webui-Civitai-Helper\setting.json
Loading weights [f762cdef02] from P:\Stable Diffusion Checkpoints\ProtoGen_X5.3.ckpt
Creating model from config: C:\sdcurrent\configs\v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Couldn't find VAE named vae-ft-mse-840000-ema-pruned.safetensors; using None instead
Textual inversion embeddings loaded(0):
Textual inversion embeddings skipped(3): nartfixer, nfixer, nrealfixer
Model loaded in 44.1s (load weights from disk: 39.7s, create model: 0.3s, apply weights to model: 0.7s, apply channels_last: 0.6s, apply half(): 0.6s, move model to device: 0.7s, load textual inversion embeddings: 0.8s, calculate empty prompt: 0.7s).
[-] ADetailer initialized. version: 23.7.5, num models: 9
[AddNet] Updating model hashes...
0it [00:00, ?it/s]
[AddNet] Updating model hashes...
0it [00:00, ?it/s]
2023-07-13 06:35:08,038 - ControlNet - INFO - ControlNet v1.1.232
ControlNet preprocessor location: C:\sdcurrent\extensions\sd-webui-controlnet\annotator\downloads
2023-07-13 06:35:08,132 - ControlNet - INFO - ControlNet v1.1.232
2023-07-13 06:35:08,413 - roop - INFO - roop v0.0.2
2023-07-13 06:35:08,413 - roop - INFO - roop v0.0.2
Applying attention optimization: xformers... done.
load Sadtalker Checkpoints from p:\sdsadtalker\
[VRAMEstimator] Loaded benchmark data.
CUDA SETUP: Loading binary C:\sdcurrent\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cudaall.dll...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 59.6s (import torch: 3.2s, import gradio: 0.6s, import ldm: 0.3s, other imports: 0.9s, list SD models: 0.2s, load scripts: 50.6s, create ui: 2.1s, gradio launch: 1.5s).
preload_extensions_git_metadata for 61 extensions took 8.49s
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 24, 'seed': 3649264118, 'scale': 17, 'width': 256, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:07<00:00,  3.96it/s]
STARTING VAE ON GPU. 24 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:07<00:00,  4.16it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([24, 3, 256, 256])
output/mp4s/20230713_063553090779.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063527
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063527\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063527\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.13 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063527
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 24, 'seed': 3469174796, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:12<00:00,  2.36it/s]
STARTING VAE ON GPU. 24 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:12<00:00,  2.43it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([24, 3, 256, 448])
output/mp4s/20230713_063659116690.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063628
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063628\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063628\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.18 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063628
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 45, 'seed': 440291655, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.20it/s]
STARTING VAE ON GPU. 45 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.20it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([45, 3, 256, 448])
output/mp4s/20230713_063800238629.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063716
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063716\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063716\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.25 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063716
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 30, 'seed': 2121714807, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.85it/s]
STARTING VAE ON GPU. 30 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.87it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([30, 3, 256, 448])
output/mp4s/20230713_063855476690.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063821
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063821\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063821\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.20 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063821
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 40, 'seed': 1678553386, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:21<00:00,  1.39it/s]
STARTING VAE ON GPU. 40 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:21<00:00,  1.39it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([40, 3, 256, 448])
output/mp4s/20230713_063952265486.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063912
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063912\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063912\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.25 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063912
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Captain Kirk and Khan eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 46, 'seed': 3748591507, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.18it/s]
STARTING VAE ON GPU. 46 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.18it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([46, 3, 256, 448])
output/mp4s/20230713_064101232872.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064017
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064017\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064017\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.24 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064017
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Captain Kirk and Khan eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 46, 'seed': 3076414624, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.17it/s]
STARTING VAE ON GPU. 46 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.17it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([46, 3, 256, 448])
output/mp4s/20230713_064216286792.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064132
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064132\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064132\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.25 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064132
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Captain Kirk and Khan eating celery during a phaser fight.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 60, 'seed': 3079263839, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.12s/it]
STARTING VAE ON GPU. 60 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.13s/it]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([60, 3, 256, 448])
output/mp4s/20230713_064352248544.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064259
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064259\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064259\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.30 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064259
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth. Style of Kandinsky, renoir, monet, seurat.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 60, 'seed': 2569021106, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.12s/it]
STARTING VAE ON GPU. 60 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.11s/it]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([60, 3, 256, 448])
output/mp4s/20230713_064615521114.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064522
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064522\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064522\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.32 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064522
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 60, 'seed': 108366175, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.12s/it]
STARTING VAE ON GPU. 60 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.12s/it]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([60, 3, 256, 448])
output/mp4s/20230713_064738316647.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064645
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064645\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064645\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.28 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064645
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 46, 'seed': 262751843, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.17it/s]
STARTING VAE ON GPU. 46 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.16it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([46, 3, 256, 448])
output/mp4s/20230713_064843999468.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064759
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064759\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064759\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.29 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064759
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 30, 'seed': 3863835044, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.86it/s]
STARTING VAE ON GPU. 30 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.86it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([30, 3, 256, 448])
output/mp4s/20230713_064938989121.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064905
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064905\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064905\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.23 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064905
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth. style of kandinsky, renoir, seurat, monet.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 30, 'seed': 785258030, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.87it/s]
STARTING VAE ON GPU. 30 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.87it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([30, 3, 256, 448])
output/mp4s/20230713_065030732907.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064956
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064956\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064956\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 7.63 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064956
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth. style of kandinsky, renoir, seurat, monet.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 45, 'seed': 251782871, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.19it/s]
STARTING VAE ON GPU. 45 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.19it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([45, 3, 256, 448])
output/mp4s/20230713_065139745067.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713065055
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713065055\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713065055\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.28 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713065055
Interrupted with signal 2 in <frame at 0x00000271BFCF9D80, file 'c:\\python\\python310\\lib\\threading.py', line 324, code wait>
Additional information

No response
kabachuha / sd-webui-text2video

[Bug]: Anything beyond 45 frames fails with weird geometric shapes #206