modelscope / DiffSynth-Studio

Enjoy the magic of Diffusion models!
Apache License 2.0
6.4k stars 575 forks source link

Toon Shading (Diffutoon): Error: Sizes of tensors must match except in dimension 1 #167

Closed nitinmukesh closed 1 month ago

nitinmukesh commented 1 month ago

https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/Diffutoon/README.md

I am trying to run the following but getting error.

(venv) C:\tut\DiffSynth-Studio>python examples\Diffutoon\diffutoon_toon_shading.py
Failed to load cpm_kernels:No module named 'cpm_kernels'
Downloading models: ['AingDiffusion_v12', 'AnimateDiff_v2', 'ControlNet_v11p_sd15_lineart', 'ControlNet_v11f1e_sd15_tile', 'TextualInversion_VeryBadImageNegative_v1.3']
    aingdiffusion_v12.safetensors has been already in models/stable_diffusion.
    mm_sd_v15_v2.ckpt has been already in models/AnimateDiff.
    control_v11p_sd15_lineart.pth has been already in models/ControlNet.
    sk_model.pth has been already in models/Annotators.
    sk_model2.pth has been already in models/Annotators.
    control_v11f1e_sd15_tile.pth has been already in models/ControlNet.
    verybadimagenegative_v1.3.pt has been already in models/textual_inversion.
Loading models from: models/stable_diffusion/aingdiffusion_v12.safetensors
    model_name: sd_text_encoder model_class: SDTextEncoder
    model_name: sd_unet model_class: SDUNet
    model_name: sd_vae_decoder model_class: SDVAEDecoder
    model_name: sd_vae_encoder model_class: SDVAEEncoder
    The following models are loaded: ['sd_text_encoder', 'sd_unet', 'sd_vae_decoder', 'sd_vae_encoder'].
Loading models from: models/AnimateDiff/mm_sd_v15_v2.ckpt
    model_name: sd_motion_modules model_class: SDMotionModel
    The following models are loaded: ['sd_motion_modules'].
Loading models from: models/ControlNet/control_v11f1e_sd15_tile.pth
    model_name: sd_controlnet model_class: SDControlNet
    The following models are loaded: ['sd_controlnet'].
Loading models from: models/ControlNet/control_v11p_sd15_lineart.pth
    model_name: sd_controlnet model_class: SDControlNet
    The following models are loaded: ['sd_controlnet'].
C:\tut\DiffSynth-Studio\venv\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Using sd_text_encoder from models/stable_diffusion/aingdiffusion_v12.safetensors.
Using sd_unet from models/stable_diffusion/aingdiffusion_v12.safetensors.
Using sd_vae_decoder from models/stable_diffusion/aingdiffusion_v12.safetensors.
Using sd_vae_encoder from models/stable_diffusion/aingdiffusion_v12.safetensors.
Using sd_controlnet from models/ControlNet/control_v11f1e_sd15_tile.pth.
Using sd_controlnet from models/ControlNet/control_v11p_sd15_lineart.pth.
No sd_ipadapter models available.
No sd_ipadapter_clip_image_encoder models available.
Using sd_motion_modules from models/AnimateDiff/mm_sd_v15_v2.ckpt.
c:\tut\diffsynth-studio\diffsynth\models\attention.py:54: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  hidden_states = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask)
Textual inversion verybadimagenegative_v1.3 is enabled.
100%|████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 147.31it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 30/30 [00:02<00:00, 12.61it/s]
  0%|                                                                                          | 0/10 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "C:\tut\DiffSynth-Studio\examples\Diffutoon\diffutoon_toon_shading.py", line 100, in <module>
    runner.run(config)
  File "c:\tut\diffsynth-studio\diffsynth\pipelines\pipeline_runner.py", line 98, in run
    output_video = self.synthesize_video(model_manager, pipe, config["pipeline"]["seed"], smoother, **config["pipeline"]["pipeline_inputs"])
  File "c:\tut\diffsynth-studio\diffsynth\pipelines\pipeline_runner.py", line 48, in synthesize_video
    output_video = pipe(**pipeline_inputs, smoother=smoother)
  File "C:\tut\DiffSynth-Studio\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "c:\tut\diffsynth-studio\diffsynth\pipelines\sd_video.py", line 232, in __call__
    noise_pred_posi = lets_dance_with_long_video(
  File "c:\tut\diffsynth-studio\diffsynth\pipelines\sd_video.py", line 40, in lets_dance_with_long_video
    hidden_states_batch = lets_dance(
  File "c:\tut\diffsynth-studio\diffsynth\pipelines\dancer.py", line 76, in lets_dance
    hidden_states, time_emb, text_emb, res_stack = block(hidden_states, time_emb, text_emb, res_stack)
  File "C:\tut\DiffSynth-Studio\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\tut\DiffSynth-Studio\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "c:\tut\diffsynth-studio\diffsynth\models\sd_unet.py", line 226, in forward
    hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 24 but got size 23 for tensor number 1 in the list.

diffutoon_toon_shading.py

from diffsynth import SDVideoPipelineRunner, download_models # Download models (automatically) # `models/stable_diffusion/aingdiffusion_v12.safetensors`: [link](https://civitai.com/api/download/models/229575) # `models/AnimateDiff/mm_sd_v15_v2.ckpt`: [link](https://huggingface.co/guoyww/animatediff/resolve/main/mm_sd_v15_v2.ckpt) # `models/ControlNet/control_v11p_sd15_lineart.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11p_sd15_lineart.pth) # `models/ControlNet/control_v11f1e_sd15_tile.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11f1e_sd15_tile.pth) # `models/Annotators/sk_model.pth`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/sk_model.pth) # `models/Annotators/sk_model2.pth`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/sk_model2.pth) # `models/textual_inversion/verybadimagenegative_v1.3.pt`: [link](https://civitai.com/api/download/models/25820?type=Model&format=PickleTensor&size=full&fp=fp16) download_models([ "AingDiffusion_v12", "AnimateDiff_v2", "ControlNet_v11p_sd15_lineart", "ControlNet_v11f1e_sd15_tile", "TextualInversion_VeryBadImageNegative_v1.3" ]) # The original video in the example is https://www.bilibili.com/video/BV1iG411a7sQ/. config = { "models": { "model_list": [ "models/stable_diffusion/aingdiffusion_v12.safetensors", "models/AnimateDiff/mm_sd_v15_v2.ckpt", "models/ControlNet/control_v11f1e_sd15_tile.pth", "models/ControlNet/control_v11p_sd15_lineart.pth" ], "textual_inversion_folder": "models/textual_inversion", "device": "cuda", "lora_alphas": [], "controlnet_units": [ { "processor_id": "tile", "model_path": "models/ControlNet/control_v11f1e_sd15_tile.pth", "scale": 0.5 }, { "processor_id": "lineart", "model_path": "models/ControlNet/control_v11p_sd15_lineart.pth", "scale": 0.5 } ] }, "data": { "input_frames": { "video_file": "data/examples/diffutoon/input_video.mp4", "image_folder": None, "height": 360, "width": 640, "start_frame_id": 0, "end_frame_id": 30 }, "controlnet_frames": [ { "video_file": "data/examples/diffutoon/input_video.mp4", "image_folder": None, "height": 360, "width": 640, "start_frame_id": 0, "end_frame_id": 30 }, { "video_file": "data/examples/diffutoon/input_video.mp4", "image_folder": None, "height": 360, "width": 640, "start_frame_id": 0, "end_frame_id": 30 } ], "output_folder": "output", "fps": 30 }, "pipeline": { "seed": 0, "pipeline_inputs": { "prompt": "best quality, perfect anime illustration, light, a girl is dancing, smile, solo", "negative_prompt": "verybadimagenegative_v1.3", "cfg_scale": 7.0, "clip_skip": 2, "denoising_strength": 1.0, "num_inference_steps": 10, "animatediff_batch_size": 16, "animatediff_stride": 8, "unet_batch_size": 1, "controlnet_batch_size": 1, "cross_frame_attention": False, # The following parameters will be overwritten. You don't need to modify them. "input_frames": [], "num_frames": 30, "width": 640, "height": 360, "controlnet_frames": [] } } } runner = SDVideoPipelineRunner() runner.run(config)
nitinmukesh commented 1 month ago

Following worked

"width": 1024, "height": 576,