kabachuha / sd-webui-text2video

Auto1111 extension implementing text2video diffusion models (like ModelScope or VideoCrafter) using only Auto1111 webui dependencies
Other
1.28k stars 107 forks source link

[Bug]: TypeError: TextToVideoSynthesis.infer() got an unexpected keyword argument 'mask' #102

Closed highjohnconquer closed 1 year ago

highjohnconquer commented 1 year ago

Is there an existing issue for this?

Are you using the latest version of the extension?

What happened?

every time i try to run txt to video I get this error:

TypeError: TextToVideoSynthesis.infer() got an unexpected keyword argument 'mask' Exception occurred: TextToVideoSynthesis.infer() got an unexpected keyword argument 'mask'

Steps to reproduce the problem

  1. open extension
  2. type prompt
  3. run extension
  4. error

What should have happened?

video should have been generated

WebUI and Deforum extension Commit IDs

webui commit id - commit: 22bcc7be txt2vid commit id - 1e32b561786f8763a2f87fcd122bc18804e4a7ab

What GPU were you using for launching?

NVIDIA GeForce RTX 3090

On which platform are you launching the webui backend with the extension?

Local PC setup (Windows)

Settings

image

Console logs


Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 13.4s (load scripts: 5.6s, reload script modules: 0.1s, create ui: 7.4s, gradio launch: 0.2s).
text2video — The model selected is:  VideoCrafter
 text2video extension for auto1111 webui
Git commit: 1e32b561 (Sat Apr  8 00:39:15 2023)
VideoCrafter config:
 {'model': {'target': 'lvdm.models.ddpm3d.LatentDiffusion', 'params': {'linear_start': 0.00085, 'linear_end': 0.012, 'num_timesteps_cond': 1, 'log_every_t': 200, 'timesteps': 1000, 'first_stage_key': 'video', 'cond_stage_key': 'caption', 'image_size': [32, 32], 'video_length': 16, 'channels': 4, 'cond_stage_trainable': False, 'conditioning_key': 'crossattn', 'scale_by_std': False, 'scale_factor': 0.18215, 'unet_config': {'target': 'lvdm.models.modules.openaimodel3d.UNetModel', 'params': {'image_size': 32, 'in_channels': 4, 'out_channels': 4, 'model_channels': 320, 'attention_resolutions': [4, 2, 1], 'num_res_blocks': 2, 'channel_mult': [1, 2, 4, 4], 'num_heads': 8, 'transformer_depth': 1, 'context_dim': 768, 'use_checkpoint': True, 'legacy': False, 'kernel_size_t': 1, 'padding_t': 0, 'temporal_length': 16, 'use_relative_position': True}}, 'first_stage_config': {'target': 'lvdm.models.autoencoder.AutoencoderKL', 'params': {'embed_dim': 4, 'monitor': 'val/rec_loss', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}, 'lossconfig': {'target': 'torch.nn.Identity'}}}, 'cond_stage_config': {'target': 'lvdm.models.modules.condition_modules.FrozenCLIPEmbedder'}}}}
Loading model from E:\Documents\AI\stable-diffusion-webui\models/VideoCrafter/model.ckpt
LatentDiffusion: Running in eps-prediction mode
Successfully initialize the diffusion model !
DiffusionWrapper has 958.92 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████| 1.71G/1.71G [00:30<00:00, 56.5MB/s]
Sampling Batches (text-to-video): 100%|██████████████████████████████████████████████████| 1/1 [00:13<00:00, 13.40s/it]
text2video finished, saving frames to E:\Documents\AI\stable-diffusion-webui\outputs/img2img-images\text2video\20230410123039ing Batches (text-to-video): 100%|██████████████████████████████████████████████████| 1/1 [00:13<00:00, 13.40s/it]
Adding empty frames: 100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
Making grids: 100%|█████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 15993.53it/s]
Adding soundtrack to *video*...                                                                  | 0/1 [00:00<?, ?it/s]
FFmpeg Audio stitching done in 2.27 seconds!                                                    | 0/16 [00:00<?, ?it/s]
t2v complete, result saved at E:\Documents\AI\stable-diffusion-webui\outputs/img2img-images\text2video\20230410123039
Finish sampling!
Run time = 16.55 seconds
text2video — The model selected is:  ModelScope
 text2video extension for auto1111 webui
Git commit: 1e32b561 (Sat Apr  8 00:39:15 2023)
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "E:\Documents\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\text2vid.py", line 94, in process
    process_modelscope(skip_video_creation, ffmpeg_location, ffmpeg_crf, ffmpeg_preset, fps, add_soundtrack, soundtrack_path, \
  File "E:\Documents\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\text2vid.py", line 286, in process_modelscope
    samples, _ = pipe.infer(prompt, n_prompt, steps, frames, seed + batch if seed != -1 else -1, cfg_scale,
TypeError: TextToVideoSynthesis.infer() got an unexpected keyword argument 'mask'
Exception occurred: TextToVideoSynthesis.infer() got an unexpected keyword argument 'mask'

### Additional information
kabachuha commented 1 year ago

What are the settings on your extension page: mode, steps, etc?

pmonck commented 1 year ago

I occasionally get this too. It goes away if I restart Automatic1111. I'll add more info if it happens again.

highjohnconquer commented 1 year ago

What are the settings on your extension page: mode, steps, etc?

image

image

highjohnconquer commented 1 year ago

I occasionally get this too. It goes away if I restart Automatic1111. I'll add more info if it happens again.

This seems to have done the trick.

Now I'm just getting strange video outputs

justinwking commented 1 year ago

I am getting this error, or something similar, I have restarted automatic1111 and still get the error. It just started today, after updating to the most recent version, It was giving black images before updating, then it started working, and now after restarting again, it gives the following error message.

Traceback (most recent call last): | 0/31 [00:00<?, ?it/s] File "F:\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\text2vid.py", line 96, in process process_modelscope(skip_video_creation, ffmpeg_location, ffmpeg_crf, ffmpeg_preset, fps, add_soundtrack, soundtrack_path, \ File "F:\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\text2vid.py", line 297, in processmodelscope samples, = pipe.infer(prompt, n_prompt, steps, frames, seed + batch if seed != -1 else -1, cfg_scale, File "F:\AI\stable-diffusion-webui/extensions/sd-webui-text2video/scripts\modelscope\t2v_pipeline.py", line 245, in infer x0 = self.diffusion.ddim_sample_loop( File "F:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "F:\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\modelscope\t2v_model.py", line 1469, in ddim_sample_loop xt = self.ddim_sample(xt, t, model, model_kwargs, clamp, File "F:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "F:\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\modelscope\t2v_model.py", line 1321, in ddimsample , , , x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, File "F:\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\modelscope\t2v_model.py", line 1262, in p_mean_variance y_out = model(xt, self._scale_timesteps(t), model_kwargs[0]) File "F:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "F:\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\modelscope\t2v_model.py", line 367, in forward x = self._forward_single(block, x, e, context, time_rel_pos_bias, File "F:\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\modelscope\t2v_model.py", line 430, in _forward_single x = self._forward_single(block, x, e, context, File "F:\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\modelscope\t2v_model.py", line 414, in _forward_single x = module(x, context) File "F:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "F:\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\modelscope\t2v_model.py", line 665, in forward x = block(x) File "F:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "F:\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\modelscope\t2v_model.py", line 732, in forward x = self.attn1( File "F:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "F:\AI\stable-diffusion-webui\extensions\sd-webui-text2video\scripts\modelscope\t2v_model.py", line 496, in forward out = xformers.ops.memory_efficient_attention( TypeError: memory_efficient_attention() got an unexpected keyword argument 'mask' Exception occurred: memory_efficient_attention() got an unexpected keyword argument 'mask'

kabachuha commented 1 year ago

@justinwking thanks for your notice! Fixed it now in https://github.com/deforum-art/sd-webui-text2video/commit/67aaba9f0856589074384b5412c4553647f02d22