kabachuha / sd-webui-text2video

Auto1111 extension implementing text2video diffusion models (like ModelScope or VideoCrafter) using only Auto1111 webui dependencies
Other
1.28k stars 108 forks source link

[Bug]: stuck while generating #234

Open operationairstrike opened 9 months ago

operationairstrike commented 9 months ago

Is there an existing issue for this?

Are you using the latest version of the extension?

What happened?

text2video — The model selected is: (VideoCrafter (WIP)-like) text2video extension for auto1111 webui Git commit: 01e41fd4 VideoCrafter config: {'model': {'target': 'lvdm.models.ddpm3d.LatentDiffusion', 'params': {'linear_start': 0.00085, 'linear_end': 0.012, 'num_timesteps_cond': 1, 'log_every_t': 200, 'timesteps': 1000, 'first_stage_key': 'video', 'cond_stage_key': 'caption', 'image_size': [32, 32], 'video_length': 16, 'channels': 4, 'cond_stage_trainable': False, 'conditioning_key': 'crossattn', 'scale_by_std': False, 'scale_factor': 0.18215, 'unet_config': {'target': 'lvdm.models.modules.openaimodel3d.UNetModel', 'params': {'image_size': 32, 'in_channels': 4, 'out_channels': 4, 'model_channels': 320, 'attention_resolutions': [4, 2, 1], 'num_res_blocks': 2, 'channel_mult': [1, 2, 4, 4], 'num_heads': 8, 'transformer_depth': 1, 'context_dim': 768, 'use_checkpoint': True, 'legacy': False, 'kernel_size_t': 1, 'padding_t': 0, 'temporal_length': 16, 'use_relative_position': True}}, 'first_stage_config': {'target': 'lvdm.models.autoencoder.AutoencoderKL', 'params': {'embed_dim': 4, 'monitor': 'val/rec_loss', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}, 'lossconfig': {'target': 'torch.nn.Identity'}}}, 'cond_stage_config': {'target': 'lvdm.models.modules.condition_modules.FrozenCLIPEmbedder'}}}} Loading model from C:\Users\user\Desktop\stable-diffusion-webui\models/VideoCrafter/model.ckpt LatentDiffusion: Running in eps-prediction mode Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads. Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads. Successfully initialize the diffusion model ! DiffusionWrapper has 958.92 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 32, 32) = 4096 dimensions. making attention of type 'vanilla' with 512 in_channels Downloading (…)olve/main/vocab.json: 100%|██████████████████████████████████████████| 961k/961k [00:00<00:00, 5.75MB/s] Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████| 525k/525k [00:00<00:00, 26.6MB/s] Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████| 389/389 [00:00<?, ?B/s] Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████| 905/905 [00:00<?, ?B/s] Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████| 4.52k/4.52k [00:00<00:00, 4.45MB/s] Downloading model.safetensors: 100%|██████████████████████████████████████████████| 1.71G/1.71G [02:12<00:00, 12.9MB/s] 0%| | 0/1 [00:00<?, ?it/s] Sampling Batches (text-to-video): 0%| | 0/1 [00:00<?, ?it/s]

Steps to reproduce the problem

  1. Install Extension
  2. Put videocrafter model from gdrive into folder mentioned
  3. run on web ui

What should have happened?

No response

WebUI and Deforum extension Commit IDs

webui commit id - 5ef669de080814067961f28357256e8fe27544f4 txt2vid commit id - 01e41fd4

Torch version

2.0.1+cu118

What GPU were you using for launching?

Nvidia rtx 4060 mobile

On which platform are you launching the webui backend with the extension?

Local PC setup (Windows)

Settings

Screenshot 2023-10-21 225611

Console logs

text2video — The model selected is: <videocrafter> (VideoCrafter (WIP)-like)
 text2video extension for auto1111 webui
Git commit: 01e41fd4
VideoCrafter config:
 {'model': {'target': 'lvdm.models.ddpm3d.LatentDiffusion', 'params': {'linear_start': 0.00085, 'linear_end': 0.012, 'num_timesteps_cond': 1, 'log_every_t': 200, 'timesteps': 1000, 'first_stage_key': 'video', 'cond_stage_key': 'caption', 'image_size': [32, 32], 'video_length': 16, 'channels': 4, 'cond_stage_trainable': False, 'conditioning_key': 'crossattn', 'scale_by_std': False, 'scale_factor': 0.18215, 'unet_config': {'target': 'lvdm.models.modules.openaimodel3d.UNetModel', 'params': {'image_size': 32, 'in_channels': 4, 'out_channels': 4, 'model_channels': 320, 'attention_resolutions': [4, 2, 1], 'num_res_blocks': 2, 'channel_mult': [1, 2, 4, 4], 'num_heads': 8, 'transformer_depth': 1, 'context_dim': 768, 'use_checkpoint': True, 'legacy': False, 'kernel_size_t': 1, 'padding_t': 0, 'temporal_length': 16, 'use_relative_position': True}}, 'first_stage_config': {'target': 'lvdm.models.autoencoder.AutoencoderKL', 'params': {'embed_dim': 4, 'monitor': 'val/rec_loss', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}, 'lossconfig': {'target': 'torch.nn.Identity'}}}, 'cond_stage_config': {'target': 'lvdm.models.modules.condition_modules.FrozenCLIPEmbedder'}}}}
Loading model from C:\Users\user\Desktop\stable-diffusion-webui\models/VideoCrafter/model.ckpt
LatentDiffusion: Running in eps-prediction mode
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.
Successfully initialize the diffusion model !
DiffusionWrapper has 958.92 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Downloading (…)olve/main/vocab.json: 100%|██████████████████████████████████████████| 961k/961k [00:00<00:00, 5.75MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████| 525k/525k [00:00<00:00, 26.6MB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████| 389/389 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████| 905/905 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████| 4.52k/4.52k [00:00<00:00, 4.45MB/s]
Downloading model.safetensors: 100%|██████████████████████████████████████████████| 1.71G/1.71G [02:12<00:00, 12.9MB/s]
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]
Sampling Batches (text-to-video):   0%|                                                          | 0/1 [00:00<?, ?it/s]

Additional information

No response