kabachuha / sd-webui-text2video

Auto1111 extension implementing text2video diffusion models (like ModelScope or VideoCrafter) using only Auto1111 webui dependencies
Other
1.28k stars 108 forks source link

[Bug]: Any video beyond 45 seconds is weird geometric shapes. #205

Closed Websteria closed 1 year ago

Websteria commented 1 year ago

Is there an existing issue for this?

Are you using the latest version of the extension?

What happened?

I try to generate anything more than 30-45 frames and the videos come out with geometric shapes only

Steps to reproduce the problem

  1. Go to txt2video
  2. change frames to > 60
  3. Video turns out like this.

https://github.com/kabachuha/sd-webui-text2video/assets/3253286/05757ef4-0200-4497-929d-3ce690a644dc

What should have happened?

Video should have generated like this but longer.

https://github.com/kabachuha/sd-webui-text2video/assets/3253286/9fd542e9-b912-4711-ab68-d460d955d073

WebUI and Deforum extension Commit IDs

webui commit id - [6ce0161689] txt2vid commit id - [3f4a109a]

Torch version

2.0.0+cu118

What GPU were you using for launching?

RTX 4090 - 24GB

On which platform are you launching the webui backend with the extension?

Local PC setup (Windows)

Settings

Settings

Console logs

text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 24, 'seed': 3649264118, 'scale': 17, 'width': 256, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:07<00:00,  3.96it/s]
STARTING VAE ON GPU. 24 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:07<00:00,  4.16it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([24, 3, 256, 256])
output/mp4s/20230713_063553090779.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063527
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063527\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063527\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.13 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063527
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 24, 'seed': 3469174796, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:12<00:00,  2.36it/s]
STARTING VAE ON GPU. 24 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:12<00:00,  2.43it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([24, 3, 256, 448])
output/mp4s/20230713_063659116690.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063628
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063628\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063628\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.18 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063628
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 45, 'seed': 440291655, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.20it/s]
STARTING VAE ON GPU. 45 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.20it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([45, 3, 256, 448])
output/mp4s/20230713_063800238629.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063716
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063716\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063716\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.25 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063716
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 30, 'seed': 2121714807, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.85it/s]
STARTING VAE ON GPU. 30 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.87it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([30, 3, 256, 448])
output/mp4s/20230713_063855476690.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063821
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063821\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063821\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.20 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063821
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Elon Musk and George Bush eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 40, 'seed': 1678553386, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:21<00:00,  1.39it/s]
STARTING VAE ON GPU. 40 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:21<00:00,  1.39it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([40, 3, 256, 448])
output/mp4s/20230713_063952265486.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713063912
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063912\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713063912\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.25 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713063912
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Captain Kirk and Khan eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 46, 'seed': 3748591507, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.18it/s]
STARTING VAE ON GPU. 46 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.18it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([46, 3, 256, 448])
output/mp4s/20230713_064101232872.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064017
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064017\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064017\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.24 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064017
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Captain Kirk and Khan eating celery.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 46, 'seed': 3076414624, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.17it/s]
STARTING VAE ON GPU. 46 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.17it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([46, 3, 256, 448])
output/mp4s/20230713_064216286792.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064132
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064132\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064132\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.25 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064132
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Captain Kirk and Khan eating celery during a phaser fight.', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 60, 'seed': 3079263839, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.12s/it]
STARTING VAE ON GPU. 60 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.13s/it]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([60, 3, 256, 448])
output/mp4s/20230713_064352248544.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064259
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064259\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064259\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.30 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064259
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth. Style of Kandinsky, renoir, monet, seurat.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 60, 'seed': 2569021106, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.12s/it]
STARTING VAE ON GPU. 60 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.11s/it]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([60, 3, 256, 448])
output/mp4s/20230713_064615521114.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064522
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064522\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064522\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.32 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064522
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 60, 'seed': 108366175, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.12s/it]
STARTING VAE ON GPU. 60 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:33<00:00,  1.12s/it]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([60, 3, 256, 448])
output/mp4s/20230713_064738316647.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064645
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064645\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064645\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.28 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064645
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 46, 'seed': 262751843, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.17it/s]
STARTING VAE ON GPU. 46 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.16it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([46, 3, 256, 448])
output/mp4s/20230713_064843999468.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064759
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064759\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064759\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.29 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064759
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 30, 'seed': 3863835044, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.86it/s]
STARTING VAE ON GPU. 30 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.86it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([30, 3, 256, 448])
output/mp4s/20230713_064938989121.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064905
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064905\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064905\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.23 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064905
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth. style of kandinsky, renoir, seurat, monet.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 30, 'seed': 785258030, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.87it/s]
STARTING VAE ON GPU. 30 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:16<00:00,  1.87it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([30, 3, 256, 448])
output/mp4s/20230713_065030732907.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713064956
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064956\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713064956\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 7.63 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713064956
text2video — The model selected is: <modelscope> (ModelScope-like)
 text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': "A monster made of food, opening it's mouth. style of kandinsky, renoir, seurat, monet.", 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 45, 'seed': 251782871, 'scale': 17, 'width': 448, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM_Gaussian'}
Sampling random noise.
Sampling using DDIM_Gaussian for 30 steps.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.19it/s]
STARTING VAE ON GPU. 45 CHUNKS TO PROCESS.: 100%|██████████████████████████████████████| 30/30 [00:25<00:00,  1.19it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([45, 3, 256, 448])
output/mp4s/20230713_065139745067.mp4
text2video finished, saving frames to C:\sdcurrent\outputs/img2img-images\text2video\20230713065055
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\sdcurrent\outputs/img2img-images\text2video\20230713065055\%06d.png
To Video:
C:\sdcurrent\outputs/img2img-images\text2video\20230713065055\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.28 seconds!
t2v complete, result saved at C:\sdcurrent\outputs/img2img-images\text2video\20230713065055

Additional information

No response

github-actions[bot] commented 1 year ago

This issue has been closed due to incorrect formatting. Please address the following mistakes and reopen the issue: