modelscope / DiffSynth-Studio

Enjoy the magic of Diffusion models!
Apache License 2.0
6.59k stars 600 forks source link

ERROR: Sizes of tensors must match except in dimension 1. Expected size 136 but got size 135 for tensor number 1 in the list --which occurs at dancer.py #24

Open baomuhedao opened 7 months ago

baomuhedao commented 7 months ago

!!! Exception during processing !!! Traceback (most recent call last): File "/data/comfy-ui/execution.py", line 151, in recursive_execute output_data, output_ui = get_output_data(obj, input_data_all) File "/data/comfy-ui/execution.py", line 81, in get_output_data return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True) File "/data/comfy-ui/execution.py", line 74, in map_node_over_list results.append(getattr(obj, func)(slice_dict(input_data_all, i))) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/nodes.py", line 71, in stylize DiffSynthService().stylize(video_file_path, width, height, frames, fps, output_dir, TARGET_FPS, prompt, neg_prompt,stage1_infer_steps,stage2_infer_steps) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth_service.py", line 181, in stylize runner.run(config_stage_1) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/pipelines/stable_diffusion_video.py", line 358, in run output_video = self.synthesize_video(model_manager, pipe, config["pipeline"]["seed"], smoother, config["pipeline"]["pipeline_inputs"]) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/pipelines/stable_diffusion_video.py", line 304, in synthesize_video output_video = pipe(pipeline_inputs, smoother=smoother) File "/data/comfy-ui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/pipelines/stable_diffusion_video.py", line 226, in call noise_pred_posi = lets_dance_with_long_video( File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/pipelines/stable_diffusion_video.py", line 43, in lets_dance_with_long_video hidden_states_batch = lets_dance( File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/pipelines/dancer.py", line 72, in lets_dance hidden_states, time_emb, text_emb, res_stack = block(hidden_states, time_emb, text_emb, res_stack) File "/data/comfy-ui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/data/comfy-ui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/models/sd_unet.py", line 222, in forward hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 136 but got size 135 for tensor number 1 in the list.

Artiprocher commented 7 months ago

This error is usually caused by the incorrect size of input videos. Please check the number of frames and the resolution.

Erickrus commented 6 months ago

I guess, this is because you didn't put the standard video resolution. e.g. 512x512 or 1024x1024. So far, I dont have a good understanding of templates (config_stage_1_template, config_stage_2_template). Basically, I wrote an inference.py script as following. It works. @baomuhedao

#@title inference.py
#@markdown
%%writefile /content/DiffSynth-Studio/inference.py

import subprocess
import json
from PIL import Image
import os

def extract_video_info(video_filename):
    # Command to get video information using ffprobe
    command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0', '-show_entries', 'stream=width,height,nb_frames,r_frame_rate,codec_type', '-of', 'json', video_filename]

    # Run the command
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    output, error = process.communicate()

    if error:
        # Handle error
        print("Error:", error.decode())
        return None

    # Parse JSON output
    info = json.loads(output.decode())

    # Extract video information
    video_info = {}
    if 'streams' in info:
        for stream in info['streams']:
            if stream['codec_type'] == 'video':
                if 'width' in stream and 'height' in stream:
                    video_info['width'] = int(stream['width'])
                    video_info['height'] = int(stream['height'])
                if 'nb_frames' in stream:
                    video_info['num_frames'] = int(stream['nb_frames'])
                if 'r_frame_rate' in stream:
                    frame_rate = stream['r_frame_rate'].split('/')
                    if len(frame_rate) == 2 and float(frame_rate[1]) != 0:
                        video_info['fps'] = int(round(float(frame_rate[0]) / float(frame_rate[1])))
                    else:
                        video_info['fps'] = float(frame_rate[0])

    return video_info

def get_stage_templates(input_video_filename, output_video_dirname):

    video_info = extract_video_info(input_video_filename)
    if video_info:
        print(json.dumps(video_info, indent=2, ensure_ascii=False))
        config_stage_1_template = {
            "models": {
                "model_list": [
                    "models/stable_diffusion/aingdiffusion_v12.safetensors",
                    "models/ControlNet/control_v11p_sd15_softedge.pth",
                    "models/ControlNet/control_v11f1p_sd15_depth.pth"
                ],
                "textual_inversion_folder": "models/textual_inversion",
                "device": "cuda",
                "lora_alphas": [],
                "controlnet_units": [
                    {
                        "processor_id": "softedge",
                        "model_path": "models/ControlNet/control_v11p_sd15_softedge.pth",
                        "scale": 0.5
                    },
                    {
                        "processor_id": "depth",
                        "model_path": "models/ControlNet/control_v11f1p_sd15_depth.pth",
                        "scale": 0.5
                    }
                ]
            },
            "data": {
                "input_frames": {
                    "video_file": input_video_filename,
                    "image_folder": None,
                    "height": video_info["height"] // 2,
                    "width": video_info["width"] // 2,
                    "start_frame_id": 0,
                    "end_frame_id": video_info["num_frames"]# - 1
                },
                "controlnet_frames": [
                    {
                        "video_file": input_video_filename,
                        "image_folder": None,
                        "height": video_info["height"] // 2,
                        "width": video_info["width"] // 2,
                        "start_frame_id": 0,
                        "end_frame_id": video_info["num_frames"]# - 1
                    },
                    {
                        "video_file": input_video_filename,
                        "image_folder": None,
                        "height": video_info["height"] // 2,
                        "width": video_info["width"] // 2,
                        "start_frame_id": 0,
                        "end_frame_id": video_info["num_frames"]# - 1
                    }
                ],
                "output_folder": "data/examples/diffutoon_edit/color_video",
                "fps": video_info["fps"]
            },
            "smoother_configs": [
                {
                    "processor_type": "FastBlend",
                    "config": {}
                }
            ],
            "pipeline": {
                "seed": 0,
                "pipeline_inputs": {
                    "prompt": "best quality, perfect anime illustration, orange clothes, night, a girl is dancing, smile, solo, black silk stockings",
                    "negative_prompt": "verybadimagenegative_v1.3",
                    "cfg_scale": 7.0,
                    "clip_skip": 1,
                    "denoising_strength": 0.9,
                    "num_inference_steps": 20,
                    "animatediff_batch_size": 8,
                    "animatediff_stride": 4,
                    "unet_batch_size": 8,
                    "controlnet_batch_size": 8,
                    "cross_frame_attention": True,
                    "smoother_progress_ids": [-1],
                    # The following parameters will be overwritten. You don't need to modify them.
                    "input_frames": [],
                    "num_frames": video_info["num_frames"],
                    "width": video_info["width"] // 2,
                    "height": video_info["height"] // 2,
                    "controlnet_frames": []
                }
            }
        }

        config_stage_2_template = {
            "models": {
                "model_list": [
                    "models/stable_diffusion/aingdiffusion_v12.safetensors",
                    "models/AnimateDiff/mm_sd_v15_v2.ckpt",
                    "models/ControlNet/control_v11f1e_sd15_tile.pth",
                    "models/ControlNet/control_v11p_sd15_lineart.pth"
                ],
                "textual_inversion_folder": "models/textual_inversion",
                "device": "cuda",
                "lora_alphas": [],
                "controlnet_units": [
                    {
                        "processor_id": "tile",
                        "model_path": "models/ControlNet/control_v11f1e_sd15_tile.pth",
                        "scale": 0.5
                    },
                    {
                        "processor_id": "lineart",
                        "model_path": "models/ControlNet/control_v11p_sd15_lineart.pth",
                        "scale": 0.5
                    }
                ]
            },
            "data": {
                "input_frames": {
                    "video_file": input_video_filename,
                    "image_folder": None,
                    "height": video_info["height"],
                    "width": video_info["width"],
                    "start_frame_id": 0,
                    "end_frame_id": video_info["num_frames"]# - 1
                },
                "controlnet_frames": [
                    {
                        "video_file": input_video_filename,
                        "image_folder": None,
                        "height": video_info["height"],
                        "width": video_info["width"],
                        "start_frame_id": 0,
                        "end_frame_id": video_info["num_frames"]# - 1
                    },
                    {
                        "video_file": input_video_filename,
                        "image_folder": None,
                        "height": video_info["height"],
                        "width": video_info["width"],
                        "start_frame_id": 0,
                        "end_frame_id": video_info["num_frames"]# - 1
                    }
                ],
                "output_folder": output_video_dirname,
                "fps": video_info
            },
            "pipeline": {
                "seed": 0,
                "pipeline_inputs": {
                    "prompt": "best quality, perfect anime illustration, light, a girl is dancing, smile, solo",
                    "negative_prompt": "verybadimagenegative_v1.3",
                    "cfg_scale": 7.0,
                    "clip_skip": 2,
                    "denoising_strength": 1.0,
                    "num_inference_steps": 10,
                    "animatediff_batch_size": 16,
                    "animatediff_stride": 8,
                    "unet_batch_size": 1,
                    "controlnet_batch_size": 1,
                    "cross_frame_attention": False,
                    # The following parameters will be overwritten. You don't need to modify them.
                    "input_frames": [],
                    "num_frames": video_info["num_frames"],
                    "width": video_info["width"],
                    "height": video_info["height"],
                    "controlnet_frames": []
                }
            }
        }
        return video_info, config_stage_1_template, config_stage_2_template

def preprocess(input_video_filename, fps=5):
    print(f"preprocess(input_video_filename= {input_video_filename}, fps= {fps})")

    standard_video_filename = "/content/standard_video.mp4"
    standard_audio_filename = "/content/standard_video.wav"
    bg_filename = "/content/bg.png"

    print(f"create bg file: {bg_filename}")
    im = Image.new(mode='RGB', size=(1024, 1024), color=(0, 0, 0))
    im.save(bg_filename)

    print(f"convert video as: {standard_video_filename}")
    os.system('ffmpeg -hide_banner -loglevel error -y -i '+input_video_filename+' -i '+ bg_filename + ''' -filter_complex "[1:v]scale=1024:1024[bg]; [0:v]scale='if(gt(a,1024/1024),1024,-1)':'if(gt(a,1024/1024),-1,1024)', pad=1024:1024:(ow-iw)/2:(oh-ih)/2[video]; [bg][video]overlay=0:0" -r '''+str(fps)+' '+standard_video_filename)

    print(f"extract audio as: {standard_audio_filename}")
    os.system('ffmpeg -hide_banner -loglevel error -y -i '+input_video_filename+' '+standard_audio_filename)

    return standard_video_filename, standard_audio_filename

if __name__ == "__main__":
    import sys

    from diffsynth import SDVideoPipelineRunner

    input_video_filename = sys.argv[1]
    output_video_dirname = sys.argv[2]

    # It takes very long to generate 15sec video in 25fps
    standard_video_filename, standard_audio_filename = preprocess(input_video_filename, fps=25)

    video_info, config_stage_1_template, config_stage_2_template = get_stage_templates(
        standard_video_filename, 
        output_video_dirname
    )

    config = config_stage_2_template.copy()
    config["data"]["controlnet_frames"] = [config["data"]["input_frames"], config["data"]["input_frames"]]
    config["data"]["output_folder"] = output_video_dirname
    config["data"]["fps"] = video_info["fps"]

    runner = SDVideoPipelineRunner()
    runner.run(config)

then run this cell:

%cd /content/DiffSynth-Studio
!python3 inference.py /content/input_video.mp4 /content/toon_ouput
# the output video will be stored as /content/toon_output/video.mp4
d8ahazard commented 4 months ago

!!! Exception during processing !!! Traceback (most recent call last): File "/data/comfy-ui/execution.py", line 151, in recursive_execute output_data, output_ui = get_output_data(obj, input_data_all) File "/data/comfy-ui/execution.py", line 81, in get_output_data return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True) File "/data/comfy-ui/execution.py", line 74, in map_node_over_list results.append(getattr(obj, func)(slice_dict(input_data_all, i))) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/nodes.py", line 71, in stylize DiffSynthService().stylize(video_file_path, width, height, frames, fps, output_dir, TARGET_FPS, prompt, neg_prompt,stage1_infer_steps,stage2_infer_steps) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth_service.py", line 181, in stylize runner.run(config_stage_1) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/pipelines/stable_diffusion_video.py", line 358, in run output_video = self.synthesize_video(model_manager, pipe, config["pipeline"]["seed"], smoother, config["pipeline"]["pipeline_inputs"]) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/pipelines/stable_diffusion_video.py", line 304, in synthesize_video output_video = pipe(pipeline_inputs, smoother=smoother) File "/data/comfy-ui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/pipelines/stable_diffusion_video.py", line 226, in call* noise_pred_posi = lets_dance_with_long_video( File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/pipelines/stable_diffusion_video.py", line 43, in lets_dance_with_long_video hidden_states_batch = lets_dance( File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/pipelines/dancer.py", line 72, in lets_dance hidden_states, time_emb, text_emb, res_stack = block(hidden_states, time_emb, text_emb, res_stack) File "/data/comfy-ui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/data/comfy-ui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/comfy-ui/custom_nodes/comfyui-cartoon-stylization/diffsynth/models/sd_unet.py", line 222, in forward hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 136 but got size 135 for tensor number 1 in the list.

I spent a good chunk of time this weekend trying to track down the source of this issue, and I think I finally figured it out.

Basically, the svd unet uses a bunch of Upsampler and Downsampler blocks which scale the image up and down for various processing steps. This basically just divides the size by 2 a bunch of times, then multiplies it back up again.

The underlying issue with the mismatching shape (at least for me) came from using an input image that was 1920x1080, which I'm guessing you are as well.

Under the hood, both dims are divided by 8 - giving us a tensor of 240 x 135. Note the 135.

Now, if you divide 135 by 2, you get 67.5, which hasa decimal point, which means we wind up losing precision when down/upscaling, and it gets rounded up to 68. So then when that gets upscaled, we now have a tensor with a shape of 136 instead of 135.

Fun, isn't it?

This happens in the final upsample block in the unet (the one with 640 channels).

My fix was just to use the nearest "SDXL" resolution, which I suspect are all specific dimensions that can be subdivided without this issue.

nitinmukesh commented 2 months ago

@d8ahazard @Artiprocher

As i understand both width height should be divisible by 16. Or maybe multiples of 64.

I tried with 1920 x 1088 and both are divisible by 16 1920/16 = 120 1088/16 = 68 Still getting this error.

So for me only 1024 x 576 works. 1024/16 = 64 576/16 = 36

Is there any other higher resolution (other than 1024 x 576) that I can try?