kijai / ComfyUI-CogVideoXWrapper

1.01k stars 61 forks source link

I2V Error when changing input image after running at least once: "Sizes of tensors must match except in dimension 1. Expected size 65 but got size 60 for tensor number 1 in the list". #70

Closed genericgod closed 2 months ago

genericgod commented 2 months ago

I get this error when I exchange the input image after a single or more generations with a different image. I used an image resize node so the input image should be the exact same size. Model: CogVideoX-5b-I2V Here's the detailed comfy error report:

ComfyUI Error Report

Error Details

### System Information
- **ComfyUI Version:** v0.2.2-50-g0bfc7cc
- **Arguments:** ComfyUI\main.py --listen --preview-method auto
- **OS:** nt
- **Python Version:** 3.11.8 (tags/v3.11.8:db85d51, Feb  6 2024, 22:03:32) [MSC v.1937 64 bit (AMD64)]
- **Embedded Python:** true
- **PyTorch Version:** 2.3.1+cu121
### Devices

- **Name:** cuda:0 NVIDIA GeForce RTX 3060 : cudaMallocAsync
  - **Type:** cuda
  - **VRAM Total:** 12884377600
  - **VRAM Free:** 5946926384
  - **Torch VRAM Total:** 5737807872
  - **Torch VRAM Free:** 11371824

### Logs

2024-09-19 15:53:01,239 - root - INFO - Total VRAM 12288 MB, total RAM 32689 MB 2024-09-19 15:53:01,240 - root - INFO - pytorch version: 2.3.1+cu121 2024-09-19 15:53:07,111 - root - INFO - xformers version: 0.0.26.post1 2024-09-19 15:53:07,111 - root - INFO - Set vram state to: NORMAL_VRAM 2024-09-19 15:53:07,111 - root - INFO - Device: cuda:0 NVIDIA GeForce RTX 3060 : cudaMallocAsync 2024-09-19 15:53:08,829 - root - INFO - Using xformers cross attention 2024-09-19 15:53:14,565 - root - INFO - [Prompt Server] web root: D:\AI\ComfyUI_windows_portable\ComfyUI\web

### Attached Workflow

{"last_node_id":15,"last_link_id":30,"nodes":[{"id":9,"type":"CogVideoDecode","pos":{"0":1280,"1":370},"size":{"0":315,"1":198},"flags":{},"order":8,"mode":0,"inputs":[{"name":"pipeline","type":"COGVIDEOPIPE","link":12},{"name":"samples","type":"LATENT","link":11}],"outputs":[{"name":"images","type":"IMAGE","links":[13],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"CogVideoDecode"},"widgets_values":[true,96,96,0.083,0.083,true]},{"id":10,"type":"VHS_VideoCombine","pos":{"0":1760,"1":380},"size":[222.35000610351562,386.80735296731467],"flags":{},"order":9,"mode":0,"inputs":[{"name":"images","type":"IMAGE","link":13},{"name":"audio","type":"AUDIO","link":null},{"name":"meta_batch","type":"VHS_BatchManager","link":null},{"name":"vae","type":"VAE","link":null}],"outputs":[{"name":"Filenames","type":"VHS_FILENAMES","links":null,"shape":3}],"properties":{"Node name for S&R":"VHS_VideoCombine"},"widgets_values":{"frame_rate":8,"loop_count":0,"filename_prefix":"cogvideo","format":"image/gif","pingpong":false,"save_output":false,"videopreview":{"hidden":false,"paused":false,"params":{"filename":"cogvideo_00003.gif","subfolder":"","type":"temp","format":"image/gif","frame_rate":8},"muted":false}}},{"id":12,"type":"CogVideoTextEncode","pos":{"0":20,"1":860},"size":{"0":400,"1":200},"flags":{},"order":4,"mode":0,"inputs":[{"name":"clip","type":"CLIP","link":16}],"outputs":[{"name":"conditioning","type":"CONDITIONING","links":[17],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"CogVideoTextEncode"},"widgets_values":["",1,true]},{"id":5,"type":"CLIPLoader","pos":{"0":-420,"1":560},"size":{"0":315,"1":82},"flags":{},"order":0,"mode":0,"inputs":[],"outputs":[{"name":"CLIP","type":"CLIP","links":[14,16],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"CLIPLoader"},"widgets_values":["t5xxl_fp8_e4m3fn.safetensors","sd3"]},{"id":1,"type":"DownloadAndLoadCogVideoModel","pos":{"0":176,"1":272},"size":{"0":315,"1":154},"flags":{},"order":1,"mode":0,"inputs":[],"outputs":[{"name":"cogvideo_pipe","type":"COGVIDEOPIPE","links":[1,28],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"DownloadAndLoadCogVideoModel"},"widgets_values":["THUDM/CogVideoX-5b-I2V","bf16","enabled","disabled",false]},{"id":7,"type":"LoadImage","pos":{"0":-500,"1":1120},"size":{"0":315,"1":314},"flags":{},"order":2,"mode":0,"inputs":[],"outputs":[{"name":"IMAGE","type":"IMAGE","links":[23],"slot_index":0,"shape":3},{"name":"MASK","type":"MASK","links":null,"shape":3}],"properties":{"Node name for S&R":"LoadImage"},"widgets_values":["Doomsday_prof7.png","image"]},{"id":11,"type":"CogVideoTextEncode","pos":{"0":30,"1":580},"size":{"0":400,"1":200},"flags":{},"order":3,"mode":0,"inputs":[{"name":"clip","type":"CLIP","link":14}],"outputs":[{"name":"conditioning","type":"CONDITIONING","links":[15],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"CogVideoTextEncode"},"widgets_values":["cartoon monster typing on a keyboard",1,true]},{"id":6,"type":"CogVideoImageEncode","pos":{"0":660,"1":1020},"size":{"0":315,"1":122},"flags":{},"order":6,"mode":0,"inputs":[{"name":"pipeline","type":"COGVIDEOPIPE","link":28},{"name":"image","type":"IMAGE","link":26},{"name":"mask","type":"MASK","link":null}],"outputs":[{"name":"samples","type":"LATENT","links":[30],"shape":3,"slot_index":0}],"properties":{"Node name for S&R":"CogVideoImageEncode"},"widgets_values":[16,true]},{"id":2,"type":"CogVideoSampler","pos":{"0":770,"1":350},"size":{"0":405.5999755859375,"1":378},"flags":{},"order":7,"mode":0,"inputs":[{"name":"pipeline","type":"COGVIDEOPIPE","link":1},{"name":"positive","type":"CONDITIONING","link":15},{"name":"negative","type":"CONDITIONING","link":17},{"name":"samples","type":"LATENT","link":null},{"name":"image_cond_latents","type":"LATENT","link":30}],"outputs":[{"name":"cogvideo_pipe","type":"COGVIDEOPIPE","links":[12],"slot_index":0,"shape":3},{"name":"samples","type":"LATENT","links":[11],"slot_index":1,"shape":3}],"properties":{"Node name for S&R":"CogVideoSampler"},"widgets_values":[480,720,48,10,6,271611582497891,"randomize","DDIM",16,8,1]},{"id":15,"type":"ImageResizeKJ","pos":{"0":51,"1":1190},"size":[315,242],"flags":{},"order":5,"mode":0,"inputs":[{"name":"image","type":"IMAGE","link":23},{"name":"get_image_size","type":"IMAGE","link":null},{"name":"width_input","type":"INT","link":null,"widget":{"name":"width_input"}},{"name":"height_input","type":"INT","link":null,"widget":{"name":"height_input"}}],"outputs":[{"name":"IMAGE","type":"IMAGE","links":[26],"shape":3,"slot_index":0},{"name":"width","type":"INT","links":null,"shape":3},{"name":"height","type":"INT","links":null,"shape":3}],"properties":{"Node name for S&R":"ImageResizeKJ"},"widgets_values":[720,480,"nearest-exact",false,16,0,0]}],"links":[[1,1,0,2,0,"COGVIDEOPIPE"],[11,2,1,9,1,"LATENT"],[12,2,0,9,0,"COGVIDEOPIPE"],[13,9,0,10,0,"IMAGE"],[14,5,0,11,0,"CLIP"],[15,11,0,2,1,"CONDITIONING"],[16,5,0,12,0,"CLIP"],[17,12,0,2,2,"CONDITIONING"],[23,7,0,15,0,"IMAGE"],[26,15,0,6,1,"IMAGE"],[28,1,0,6,0,"COGVIDEOPIPE"],[30,6,0,2,4,"LATENT"]],"groups":[],"config":{},"extra":{"ds":{"scale":0.6209213230591554,"offset":[458.29435413796347,-353.01835528580034]}},"version":0.4}

Gyramuur commented 2 months ago

Same here unfortunately. Even after changing no settings and successfully genning once, the second gen throws the mentioned error.

kijai commented 2 months ago

I can't replicate this myself, I've tested lot of different images in different aspects and never gotten an error. Are you certain the output of the image resize node is exactly the same size?

henrique-galimberti commented 2 months ago

I get the same error everytime, after first generation.

This only happens with CogVideoX5B I2V workflow. With T2V or Fun workflows I dont get the error.

henrique-galimberti commented 2 months ago

@kijai not sure if this helps:

image

image

nieli123456 commented 2 months ago

同样的报错

Gyramuur commented 2 months ago

It should be noted that for me, the only setting I changed was setting vae tiling to true, but without doing that I get OOM on my 24 GB of VRAM, so currently it's essential for me to use that.

kijai commented 2 months ago

It should be noted that for me, the only setting I changed was setting vae tiling to true, but without doing that I get OOM on my 24 GB of VRAM, so currently it's essential for me to use that.

Thanks, that helped me find the issue: The VAE tiling stayed enabled for the VAE as a whole once used once, making the encoding (which uses same VAE) run tiled too and cause this issue. Should be fixed now.

genericgod commented 2 months ago

Thanks, that helped me find the issue: The VAE tiling stayed enabled for the VAE as a whole once used once, making the encoding (which uses same VAE) run tiled too and cause this issue. Should be fixed now.

@kijai Looks like that fixed it. Thank you. Great work!

4lt3r3go commented 1 month ago

Despite a thousand attempts, I managed to get it to work half of a once.. not even once, more like half a time, because it finally started rendering, but at some point, it stopped and gave me the same error again, as always:

RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list

i have no idea how you solved guys

kijai commented 1 month ago

Despite a thousand attempts, I managed to get it to work half of a once.. not even once, more like half a time, because it finally started rendering, but at some point, it stopped and gave me the same error again, as always:

RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list

i have no idea how you solved guys

Are you saying the example workflow does not run for you as it is?

4lt3r3go commented 1 month ago

tested everything again, including all json in the example folder on this github. here errors i get in this cogvideox_2b_controlnet_example_01.json file:

error: The size of tensor a (3072) must match the size of tensor b (1920) at non-singleton dimension 2

there's also a workflow on civitAI link wich gave me: Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list

i know this is not related to this github page but is really hard to find anything controlnet/lora related that is working for me. there's litterally no one sharing workflows about this right now

i think this 2 workflow have something in common, since both using controlnet but don't quote me on that 😁

kijai commented 1 month ago

tested everything again, including all json in the example folder on this github. here errors i get in this cogvideox_2b_controlnet_example_01.json file:

error: The size of tensor a (3072) must match the size of tensor b (1920) at non-singleton dimension 2

there's also a workflow on civitAI link wich gave me: Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list

i know this is not related to this github page but is really hard to find anything controlnet/lora related that is working for me. there's litterally no one sharing workflows about this right now

i think this 2 workflow have something in common, since both using controlnet but don't quote me on that 😁

That's probably because there are so many different models, and they aren't compatible with each other. LoRa trained for 2b won't work on 5b and vice versa. And they won't work between the Fun -versions and original model either.

Currently there's only controlnet for the original 2b text2vid model, it won't work with any other. What we can mix and match is very limited.

4lt3r3go commented 1 month ago

tested everything again, including all json in the example folder on this github. here errors i get in this cogvideox_2b_controlnet_example_01.json file: error: The size of tensor a (3072) must match the size of tensor b (1920) at non-singleton dimension 2 there's also a workflow on civitAI link wich gave me: Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list i know this is not related to this github page but is really hard to find anything controlnet/lora related that is working for me. there's litterally no one sharing workflows about this right now i think this 2 workflow have something in common, since both using controlnet but don't quote me on that 😁

That's probably because there are so many different models, and they aren't compatible with each other. LoRa trained for 2b won't work on 5b and vice versa. And they won't work between the Fun -versions and original model either.

Currently there's only controlnet for the original 2b text2vid model, it won't work with any other. What we can mix and match is very limited.

You solved my problem just by mentioning the compatibility issue between 2b and 5b, as well as between fun and original. i'm gonna download 2B (wich i wasnt considering till now.. always thought 5b was better) and see if works.

I would also like to know wich files all this cog have in common. do i really need to download everything or i can use files from other cog models folder? like the 2 text encoders files image and the vae image both looks kinda similar to other cogs models in those folders

kijai commented 1 month ago

tested everything again, including all json in the example folder on this github. here errors i get in this cogvideox_2b_controlnet_example_01.json file: error: The size of tensor a (3072) must match the size of tensor b (1920) at non-singleton dimension 2 there's also a workflow on civitAI link wich gave me: Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list i know this is not related to this github page but is really hard to find anything controlnet/lora related that is working for me. there's litterally no one sharing workflows about this right now i think this 2 workflow have something in common, since both using controlnet but don't quote me on that 😁

That's probably because there are so many different models, and they aren't compatible with each other. LoRa trained for 2b won't work on 5b and vice versa. And they won't work between the Fun -versions and original model either. Currently there's only controlnet for the original 2b text2vid model, it won't work with any other. What we can mix and match is very limited.

You solved my problem just by mentioning the compatibility issue between 2b and 5b, as well as between fun and original. i'm gonna download 2B (wich i wasnt considering till now.. always thought 5b was better) and see if works.

I would also like to know wich files all this cog have in common. do i really need to download everything or i can use files from other cog models folder? like the 2 text encoders files image and the vae image both looks kinda similar to other cogs models in those folders

You don't need any of the text encoders or tokenizers as we are using the Comfy T5 encoding, those also are not downloaded by the autodownload node.

4lt3r3go commented 1 month ago

thank you for your precious time, you saved me extra hdd space.

last question 🙄

is it actually possible to do I2V + controlnet driven by a video? or is just one or the other?

the workflow i see shared on Civit has both options but is one or the other

kijai commented 1 month ago

thank you for your precious time, you saved me extra hdd space.

last question 🙄

is it actually possible to do I2V + controlnet driven by a video? or is just one or the other?

the workflow i see shared on Civit has both options but is one or the other

Currently it's not possible, as the only actual controlnets are trained for the 2b text2video model.

The Fun -versions "pose" model also allows control input, but it's a whole model and can't currently be used with image conditioning.

Both of these do allow images/video encoded as latents as additional input, but that's more like video2video rather than image2video.

4lt3r3go commented 1 month ago

thank you for your precious time, you saved me extra hdd space. last question 🙄 is it actually possible to do I2V + controlnet driven by a video? or is just one or the other? the workflow i see shared on Civit has both options but is one or the other

Currently it's not possible, as the only actual controlnets are trained for the 2b text2video model.

The Fun -versions "pose" model also allows control input, but it's a whole model and can't currently be used with image conditioning.

Both of these do allow images/video encoded as latents as additional input, but that's more like video2video rather than image2video.

the whole thing is clear now at 5th day i'm playing with all this cog models. are you referring to this? image why this are connected like this and what all this supposed to do (btw this workflow is in the example folder on this github)

or maybe this one? image

kijai commented 1 month ago

thank you for your precious time, you saved me extra hdd space. last question 🙄 is it actually possible to do I2V + controlnet driven by a video? or is just one or the other? the workflow i see shared on Civit has both options but is one or the other

Currently it's not possible, as the only actual controlnets are trained for the 2b text2video model. The Fun -versions "pose" model also allows control input, but it's a whole model and can't currently be used with image conditioning. Both of these do allow images/video encoded as latents as additional input, but that's more like video2video rather than image2video.

the whole thing is clear now at 5th day i'm playing with all this cog models. are you referring to this? image why this are connected like this and what all this supposed to do (btw this workflow is in the example folder on this github)

or maybe this one? image

That looks like a mismatch between the version of the nodes and the workflow...

Mario-Forero commented 3 weeks ago

The issue seems to be related to the order of width and height in the node.

Usually other nodes use width first and then height while "CogVideo Sampler" has height first.

image