RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 16 but got size 15 for tensor number 1 in the list.

maxtlw commented 9 months ago

Hi! :) I'm really interested in the new Difftoon pipeline, but whatever input video I use I get this error

File "/home/wizard/repositories/DiffSynth-Studio/diffsynth/models/sd_unet.py", line 222, in forward
    hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 16 but got size 15 for tensor number 1 in the list.

I set up the environment as indicated in the README.md and it worked flawlessly. I have no idea what I should look for to fix this: I haven't changed anything in the settings except the input video path and its resolution.

Thank you for the help!!

Artiprocher commented 9 months ago

If you have changed the input video. The controlnet_frames should also be changed. Please check it.

AInseven commented 9 months ago

controlnet_frames and input_frames have same length, but still error, any ideas what can I do? Thank you.

I checked stable_diffusion_video.py, in line 315,

    def add_data_to_pipeline_inputs(self, data, pipeline_inputs):
        pipeline_inputs["input_frames"] = self.load_video(**data["input_frames"])
        pipeline_inputs["num_frames"] = len(pipeline_inputs["input_frames"])
        pipeline_inputs["width"], pipeline_inputs["height"] = pipeline_inputs["input_frames"][0].size
        if len(data["controlnet_frames"]) > 0:
            pipeline_inputs["controlnet_frames"] = [self.load_video(**unit) for unit in data["controlnet_frames"]]
        return pipeline_inputs

the return pipeline_inputs is: ... 'input_frames' = {list: 30} [<PIL.Image.Image image mode=RGB size=720x1080 at 0x275E498D490>, ... 'num_frames' = {int} 30 'width' = {int} 720 'height' = {int} 1080 'controlnet_frames' = {list: 2} [[<PIL.Image.Image image mode=RGB size=720x1080 at 0x275E498DCA0>, ...

0 = {list: 30} [<PIL.Image.Image image mode=RGB size=720x1080 at 0x275E498DCA0>, ... 1 = {list: 30} [<PIL.Image.Image image mode=RGB size=720x1080 at 0x275E49C3850>, ...


\DiffSynth-Studio\examples\diffutoon_toon_shading_with_editing_signals.py 
Loading videos ...
Loading videos ... done!
Loading models ...
model_list: ['C:\\sd-webui-aki-v4.4\\models\\Stable-diffusion\\anime\\aingdiffusion_v12.safetensors', 'C:\\sd-webui-aki-v4.4\\models\\ControlNet\\control_v11p_sd15_softedge.pth', 'C:\\sd-webui-aki-v4.4\\models\\ControlNet\\control_v11f1p_sd15_depth.pth']
C:\software\Anaconda\envs\DiffSynthStudio\lib\site-packages\timm\models\_factory.py:117: UserWarning: Mapping deprecated model name vit_base_resnet50_384 to current vit_base_r50_s16_384.orig_in21k_ft_in1k.
  model = create_fn(
Loading models ... done!
Loading smoother ...
Loading smoother ... done!
Synthesizing videos ...
C:\Users\ainse\PycharmProjects\DiffSynth-Studio\diffsynth\models\attention.py:43: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  hidden_states = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask)
100%|██████████| 30/30 [00:02<00:00, 14.11it/s]
100%|██████████| 30/30 [00:04<00:00,  6.86it/s]
  0%|          | 0/20 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\ainse\PycharmProjects\DiffSynth-Studio\examples\diffutoon_toon_shading_with_editing_signals.py", line 185, in <module>
    runner.run(config_stage_1)
  File "C:\Users\ainse\PycharmProjects\DiffSynth-Studio\diffsynth\pipelines\stable_diffusion_video.py", line 357, in run
    output_video = self.synthesize_video(model_manager, pipe, config["pipeline"]["seed"], smoother, **config["pipeline"]["pipeline_inputs"])
  File "C:\Users\ainse\PycharmProjects\DiffSynth-Studio\diffsynth\pipelines\stable_diffusion_video.py", line 300, in synthesize_video
    output_video = pipe(**pipeline_inputs, smoother=smoother)
  File "C:\software\Anaconda\envs\DiffSynthStudio\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\ainse\PycharmProjects\DiffSynth-Studio\diffsynth\pipelines\stable_diffusion_video.py", line 221, in __call__
    noise_pred_posi = lets_dance_with_long_video(
  File "C:\Users\ainse\PycharmProjects\DiffSynth-Studio\diffsynth\pipelines\stable_diffusion_video.py", line 38, in lets_dance_with_long_video
    hidden_states_batch = lets_dance(
  File "C:\Users\ainse\PycharmProjects\DiffSynth-Studio\diffsynth\pipelines\dancer.py", line 72, in lets_dance
    hidden_states, time_emb, text_emb, res_stack = block(hidden_states, time_emb, text_emb, res_stack)
  File "C:\software\Anaconda\envs\DiffSynthStudio\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\software\Anaconda\envs\DiffSynthStudio\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\ainse\PycharmProjects\DiffSynth-Studio\diffsynth\models\sd_unet.py", line 222, in forward
    hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 24 but got size 23 for tensor number 1 in the list.

Artiprocher commented 9 months ago

Oh! The resolution is not supported. It should be a multiple of 64. For example, 1536*1024.

modelscope / DiffSynth-Studio

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 16 but got size 15 for tensor number 1 in the list. #9