kabachuha / sd-webui-text2video

Auto1111 extension implementing text2video diffusion models (like ModelScope or VideoCrafter) using only Auto1111 webui dependencies
Other
1.28k stars 107 forks source link

[Bug]: RuntimeError: Input type (double) and bias type (struct c10::Half) should be the same Exception occurred: Input type (double) and bias type (struct c10::Half) should be the same #120

Closed maiagates closed 1 year ago

maiagates commented 1 year ago

Is there an existing issue for this?

Are you using the latest version of the extension?

What happened?

Encounter an error while using img2vid, other parts of the WebUI and the text2vid extension (text2vid, vid2vid)work fine,

Steps to reproduce the problem

  1. Double clic the webui.bat file
  2. Wait until it loads
  3. Enter the local url in my browser (Chrome)
  4. Go to the text2video tab
  5. I load a square image in the tab of img2vid
  6. Put a simple prompt "anime girl with red hair" and leaving the negatives by default
  7. Select an equal amount number of frames and inpainting frames (24)
  8. Copy and paste the example parameters in the Inpainting weights
  9. Clic generate
  10. CMD gives this error

What should have happened?

A video should have been generated based on the image uploaded in the img2vid tab in the extension

WebUI and Deforum extension Commit IDs

webui commit id - 22bcc7be428c94e9408f589966c2040187245d81 txt2vid commit id - 9b79cb8d3ab44de883c5ffafd89dd708f251458a

What GPU were you using for launching?

Gtx 3060 12gb VRAM

On which platform are you launching the webui backend with the extension?

Local PC setup (Windows)

Settings

error image

Console logs

venv "D:\stable-diffusion-webui\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Commit hash: 22bcc7be428c94e9408f589966c2040187245d81
Installing requirements for Web UI
Installing requirements for Batch Face Swap

current transparent-background 1.2.3

Launching Web UI with arguments: --xformers
Loading weights [e7bf829cff] from D:\stable-diffusion-webui\models\Stable-diffusion\nightSkyYOZORAStyle_yozoraV1Origin.safetensors
Creating model from config: D:\stable-diffusion-webui\configs\v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading VAE weights specified in settings: D:\stable-diffusion-webui\models\VAE\orangemix.vae.pt
Applying xformers cross attention optimization.
Textual inversion embeddings loaded(2): easynegative, m4r1asd
Model loaded in 52.7s (load weights from disk: 0.5s, create model: 0.6s, apply weights to model: 17.4s, apply half(): 2.6s, load VAE: 8.4s, move model to device: 1.4s, hijack: 0.2s, load textual inversion embeddings: 21.6s).
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 64.8s (import torch: 2.0s, import gradio: 1.6s, import ldm: 0.6s, other imports: 1.2s, load scripts: 2.0s, load SD checkpoint: 53.1s, create ui: 3.7s, gradio launch: 0.5s).
text2video — The model selected is:  ModelScope
 text2video extension for auto1111 webui
Git commit: 9b79cb8d (Sun Apr 16 12:17:20 2023)
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Received an image for inpainting C:\Users\diego\AppData\Local\Temp\c418ffc32a1055d8c02825456f834e80b8fe560b\00001-165371138.png
Converted the frames to tensor (1, 24, 3, 256, 256)
Computing latents
STARTING VAE ON GPU
VAE HALVED
latents torch.Size([1, 4, 24, 32, 32]) tensor(-0.0112, device='cuda:0', dtype=torch.float64) tensor(0.8821, device='cuda:0', dtype=torch.float64)
DDIM sampling:   0%|                                                                            | 0/31 [00:00<?, ?it/s]
Traceback (most recent call last):                                                              | 0/31 [00:00<?, ?it/s]
  File "D:\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\t2v_helpers\render.py", line 24, in run
    vids_pack = process_modelscope(args_dict)
  File "D:\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\modelscope\process_modelscope.py", line 193, in process_modelscope
    samples, _ = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale,
  File "D:\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\modelscope\t2v_pipeline.py", line 245, in infer
    x0 = self.diffusion.ddim_sample_loop(
  File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "D:\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1470, in ddim_sample_loop
    xt = self.ddim_sample(xt, t, model, model_kwargs, clamp,
  File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "D:\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1322, in ddim_sample
    _, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp,
  File "D:\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1263, in p_mean_variance
    y_out = model(xt, self._scale_timesteps(t), **model_kwargs[0])
  File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 367, in forward
    x = self._forward_single(block, x, e, context, time_rel_pos_bias,
  File "D:\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 430, in _forward_single
    x = self._forward_single(block, x, e, context,
  File "D:\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 434, in _forward_single
    x = module(x)
  File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion-webui\extensions-builtin\Lora\lora.py", line 319, in lora_Conv2d_forward
    return torch.nn.Conv2d_forward_before_lora(self, input)
  File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (double) and bias type (struct c10::Half) should be the same
Exception occurred: Input type (double) and bias type (struct c10::Half) should be the same

Additional information

webui.bat has --xformers added VAE HALVED during process of img2vid