Closed Websteria closed 1 year ago
I try to generate anything more than 30-45 frames and the videos come out with geometric shapes only
https://github.com/kabachuha/sd-webui-text2video/assets/3253286/05757ef4-0200-4497-929d-3ce690a644dc
Video should have generated like this but longer.
https://github.com/kabachuha/sd-webui-text2video/assets/3253286/9fd542e9-b912-4711-ab68-d460d955d073
webui commit id - [6ce0161689] txt2vid commit id - [3f4a109a]
2.0.0+cu118
RTX 4090 - 24GB
Local PC setup (Windows)
text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 24, 'seed': 3649264118, 'scale': 17, 'width': 256, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:07<00:00, 3.96it/s] STARTING VAE ON GPU. 24 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:07<00:00, 4.16it/s] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([24, 3, 256, 256]) output/mp4s/20230713_063553090779.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063527 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713063527\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713063527\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.13 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063527 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 24, 'seed': 3469174796, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:12<00:00, 2.36it/s] STARTING VAE ON GPU. 24 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:12<00:00, 2.43it/s] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([24, 3, 256, 448]) output/mp4s/20230713_063659116690.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063628 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713063628\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713063628\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.18 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063628 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 45, 'seed': 440291655, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00, 1.20it/s] STARTING VAE ON GPU. 45 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00, 1.20it/s] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([45, 3, 256, 448]) output/mp4s/20230713_063800238629.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063716 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713063716\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713063716\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.25 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063716 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 30, 'seed': 2121714807, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00, 1.85it/s] STARTING VAE ON GPU. 30 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00, 1.87it/s] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([30, 3, 256, 448]) output/mp4s/20230713_063855476690.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063821 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713063821\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713063821\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.20 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063821 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 40, 'seed': 1678553386, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:21<00:00, 1.39it/s] STARTING VAE ON GPU. 40 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:21<00:00, 1.39it/s] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([40, 3, 256, 448]) output/mp4s/20230713_063952265486.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063912 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713063912\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713063912\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.25 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063912 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': 'Captain Kirk and Khan eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 46, 'seed': 3748591507, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00, 1.18it/s] STARTING VAE ON GPU. 46 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00, 1.18it/s] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([46, 3, 256, 448]) output/mp4s/20230713_064101232872.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064017 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713064017\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713064017\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.24 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064017 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': 'Captain Kirk and Khan eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 46, 'seed': 3076414624, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00, 1.17it/s] STARTING VAE ON GPU. 46 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00, 1.17it/s] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([46, 3, 256, 448]) output/mp4s/20230713_064216286792.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064132 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713064132\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713064132\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.25 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064132 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': 'Captain Kirk and Khan eating celery during a phaser fight.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 60, 'seed': 3079263839, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00, 1.12s/it] STARTING VAE ON GPU. 60 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00, 1.13s/it] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([60, 3, 256, 448]) output/mp4s/20230713_064352248544.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064259 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713064259\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713064259\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.30 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064259 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': "A monster made of food, opening it's mouth. Style of Kandinsky, renoir, monet, seurat.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 60, 'seed': 2569021106, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00, 1.12s/it] STARTING VAE ON GPU. 60 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00, 1.11s/it] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([60, 3, 256, 448]) output/mp4s/20230713_064615521114.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064522 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713064522\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713064522\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.32 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064522 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': "A monster made of food, opening it's mouth.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 60, 'seed': 108366175, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00, 1.12s/it] STARTING VAE ON GPU. 60 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00, 1.12s/it] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([60, 3, 256, 448]) output/mp4s/20230713_064738316647.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064645 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713064645\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713064645\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.28 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064645 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': "A monster made of food, opening it's mouth.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 46, 'seed': 262751843, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00, 1.17it/s] STARTING VAE ON GPU. 46 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00, 1.16it/s] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([46, 3, 256, 448]) output/mp4s/20230713_064843999468.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064759 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713064759\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713064759\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.29 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064759 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': "A monster made of food, opening it's mouth.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 30, 'seed': 3863835044, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00, 1.86it/s] STARTING VAE ON GPU. 30 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00, 1.86it/s] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([30, 3, 256, 448]) output/mp4s/20230713_064938989121.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064905 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713064905\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713064905\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.23 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064905 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': "A monster made of food, opening it's mouth. style of kandinsky, renoir, seurat, monet.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 30, 'seed': 785258030, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00, 1.87it/s] STARTING VAE ON GPU. 30 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00, 1.87it/s] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([30, 3, 256, 448]) output/mp4s/20230713_065030732907.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064956 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713064956\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713064956\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 7.63 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064956 text2video — The model selected is: <modelscope> (ModelScope-like) text2video extension for auto1111 webui Git commit: 3f4a109a Starting text2video Pipeline setup config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'}) device cuda Working in txt2vid mode 0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters: {'prompt': "A monster made of food, opening it's mouth. style of kandinsky, renoir, seurat, monet.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 45, 'seed': 251782871, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'} Sampling random noise. Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00, 1.19it/s] STARTING VAE ON GPU. 45 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00, 1.19it/s] VAE HALVED DECODING FRAMES VAE FINISHED torch.Size([45, 3, 256, 448]) output/mp4s/20230713_065139745067.mp4 text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713065055 Got a request to stitch frames to video using FFmpeg. Frames: C:\sdcurrent\outputs/img2img-images\text2video\20230713065055\%06d.png To Video: C:\sdcurrent\outputs/img2img-images\text2video\20230713065055\vid.mp4 Stitching *video*... Stitching *video*... Video stitching done in 0.28 seconds! t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713065055
No response
This issue has been closed due to incorrect formatting. Please address the following mistakes and reopen the issue:
Is there an existing issue for this?
Are you using the latest version of the extension?
What happened?
I try to generate anything more than 30-45 frames and the videos come out with geometric shapes only
Steps to reproduce the problem
https://github.com/kabachuha/sd-webui-text2video/assets/3253286/05757ef4-0200-4497-929d-3ce690a644dc
What should have happened?
Video should have generated like this but longer.
https://github.com/kabachuha/sd-webui-text2video/assets/3253286/9fd542e9-b912-4711-ab68-d460d955d073
WebUI and Deforum extension Commit IDs
webui commit id - [6ce0161689] txt2vid commit id - [3f4a109a]
Torch version
2.0.0+cu118
What GPU were you using for launching?
RTX 4090 - 24GB
On which platform are you launching the webui backend with the extension?
Local PC setup (Windows)
Settings
Console logs
Additional information
No response