Closed genericgod closed 2 months ago
Same here unfortunately. Even after changing no settings and successfully genning once, the second gen throws the mentioned error.
I can't replicate this myself, I've tested lot of different images in different aspects and never gotten an error. Are you certain the output of the image resize node is exactly the same size?
I get the same error everytime, after first generation.
Expected size 65 but got size 60 for tensor number 1 in the list
Expected size 65 but got size 60 for tensor number 1 in the list
This only happens with CogVideoX5B I2V workflow. With T2V or Fun workflows I dont get the error.
@kijai not sure if this helps:
同样的报错
It should be noted that for me, the only setting I changed was setting vae tiling to true, but without doing that I get OOM on my 24 GB of VRAM, so currently it's essential for me to use that.
It should be noted that for me, the only setting I changed was setting vae tiling to true, but without doing that I get OOM on my 24 GB of VRAM, so currently it's essential for me to use that.
Thanks, that helped me find the issue: The VAE tiling stayed enabled for the VAE as a whole once used once, making the encoding (which uses same VAE) run tiled too and cause this issue. Should be fixed now.
Thanks, that helped me find the issue: The VAE tiling stayed enabled for the VAE as a whole once used once, making the encoding (which uses same VAE) run tiled too and cause this issue. Should be fixed now.
@kijai Looks like that fixed it. Thank you. Great work!
Despite a thousand attempts, I managed to get it to work half of a once.. not even once, more like half a time, because it finally started rendering, but at some point, it stopped and gave me the same error again, as always:
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list
i have no idea how you solved guys
Despite a thousand attempts, I managed to get it to work half of a once.. not even once, more like half a time, because it finally started rendering, but at some point, it stopped and gave me the same error again, as always:
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list
i have no idea how you solved guys
Are you saying the example workflow does not run for you as it is?
tested everything again, including all json in the example folder on this github. here errors i get in this cogvideox_2b_controlnet_example_01.json file:
error: The size of tensor a (3072) must match the size of tensor b (1920) at non-singleton dimension 2
there's also a workflow on civitAI link
wich gave me:
Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list
i know this is not related to this github page but is really hard to find anything controlnet/lora related that is working for me. there's litterally no one sharing workflows about this right now
i think this 2 workflow have something in common, since both using controlnet but don't quote me on that 😁
tested everything again, including all json in the example folder on this github. here errors i get in this cogvideox_2b_controlnet_example_01.json file:
error: The size of tensor a (3072) must match the size of tensor b (1920) at non-singleton dimension 2
there's also a workflow on civitAI link wich gave me:
Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list
i know this is not related to this github page but is really hard to find anything controlnet/lora related that is working for me. there's litterally no one sharing workflows about this right now
i think this 2 workflow have something in common, since both using controlnet but don't quote me on that 😁
That's probably because there are so many different models, and they aren't compatible with each other. LoRa trained for 2b won't work on 5b and vice versa. And they won't work between the Fun -versions and original model either.
Currently there's only controlnet for the original 2b text2vid model, it won't work with any other. What we can mix and match is very limited.
tested everything again, including all json in the example folder on this github. here errors i get in this cogvideox_2b_controlnet_example_01.json file:
error: The size of tensor a (3072) must match the size of tensor b (1920) at non-singleton dimension 2
there's also a workflow on civitAI link wich gave me:Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list
i know this is not related to this github page but is really hard to find anything controlnet/lora related that is working for me. there's litterally no one sharing workflows about this right now i think this 2 workflow have something in common, since both using controlnet but don't quote me on that 😁That's probably because there are so many different models, and they aren't compatible with each other. LoRa trained for 2b won't work on 5b and vice versa. And they won't work between the Fun -versions and original model either.
Currently there's only controlnet for the original 2b text2vid model, it won't work with any other. What we can mix and match is very limited.
You solved my problem just by mentioning the compatibility issue between 2b and 5b, as well as between fun and original. i'm gonna download 2B (wich i wasnt considering till now.. always thought 5b was better) and see if works.
I would also like to know wich files all this cog have in common. do i really need to download everything or i can use files from other cog models folder? like the 2 text encoders files and the vae both looks kinda similar to other cogs models in those folders
tested everything again, including all json in the example folder on this github. here errors i get in this cogvideox_2b_controlnet_example_01.json file:
error: The size of tensor a (3072) must match the size of tensor b (1920) at non-singleton dimension 2
there's also a workflow on civitAI link wich gave me:Sizes of tensors must match except in dimension 2. Expected size 60 but got size 96 for tensor number 1 in the list
i know this is not related to this github page but is really hard to find anything controlnet/lora related that is working for me. there's litterally no one sharing workflows about this right now i think this 2 workflow have something in common, since both using controlnet but don't quote me on that 😁That's probably because there are so many different models, and they aren't compatible with each other. LoRa trained for 2b won't work on 5b and vice versa. And they won't work between the Fun -versions and original model either. Currently there's only controlnet for the original 2b text2vid model, it won't work with any other. What we can mix and match is very limited.
You solved my problem just by mentioning the compatibility issue between 2b and 5b, as well as between fun and original. i'm gonna download 2B (wich i wasnt considering till now.. always thought 5b was better) and see if works.
I would also like to know wich files all this cog have in common. do i really need to download everything or i can use files from other cog models folder? like the 2 text encoders files and the vae both looks kinda similar to other cogs models in those folders
You don't need any of the text encoders or tokenizers as we are using the Comfy T5 encoding, those also are not downloaded by the autodownload node.
thank you for your precious time, you saved me extra hdd space.
last question 🙄
is it actually possible to do I2V + controlnet driven by a video? or is just one or the other?
the workflow i see shared on Civit has both options but is one or the other
thank you for your precious time, you saved me extra hdd space.
last question 🙄
is it actually possible to do I2V + controlnet driven by a video? or is just one or the other?
the workflow i see shared on Civit has both options but is one or the other
Currently it's not possible, as the only actual controlnets are trained for the 2b text2video model.
The Fun -versions "pose" model also allows control input, but it's a whole model and can't currently be used with image conditioning.
Both of these do allow images/video encoded as latents as additional input, but that's more like video2video rather than image2video.
thank you for your precious time, you saved me extra hdd space. last question 🙄 is it actually possible to do I2V + controlnet driven by a video? or is just one or the other? the workflow i see shared on Civit has both options but is one or the other
Currently it's not possible, as the only actual controlnets are trained for the 2b text2video model.
The Fun -versions "pose" model also allows control input, but it's a whole model and can't currently be used with image conditioning.
Both of these do allow images/video encoded as latents as additional input, but that's more like video2video rather than image2video.
the whole thing is clear now at 5th day i'm playing with all this cog models. are you referring to this? why this are connected like this and what all this supposed to do (btw this workflow is in the example folder on this github)
or maybe this one?
thank you for your precious time, you saved me extra hdd space. last question 🙄 is it actually possible to do I2V + controlnet driven by a video? or is just one or the other? the workflow i see shared on Civit has both options but is one or the other
Currently it's not possible, as the only actual controlnets are trained for the 2b text2video model. The Fun -versions "pose" model also allows control input, but it's a whole model and can't currently be used with image conditioning. Both of these do allow images/video encoded as latents as additional input, but that's more like video2video rather than image2video.
the whole thing is clear now at 5th day i'm playing with all this cog models. are you referring to this? why this are connected like this and what all this supposed to do (btw this workflow is in the example folder on this github)
or maybe this one?
That looks like a mismatch between the version of the nodes and the workflow...
The issue seems to be related to the order of width and height in the node.
Usually other nodes use width first and then height while "CogVideo Sampler" has height first.
I get this error when I exchange the input image after a single or more generations with a different image. I used an image resize node so the input image should be the exact same size. Model: CogVideoX-5b-I2V Here's the detailed comfy error report:
ComfyUI Error Report
Error Details
Exception Message: Sizes of tensors must match except in dimension 1. Expected size 65 but got size 60 for tensor number 1 in the list.
Stack Trace
2024-09-19 15:53:01,239 - root - INFO - Total VRAM 12288 MB, total RAM 32689 MB 2024-09-19 15:53:01,240 - root - INFO - pytorch version: 2.3.1+cu121 2024-09-19 15:53:07,111 - root - INFO - xformers version: 0.0.26.post1 2024-09-19 15:53:07,111 - root - INFO - Set vram state to: NORMAL_VRAM 2024-09-19 15:53:07,111 - root - INFO - Device: cuda:0 NVIDIA GeForce RTX 3060 : cudaMallocAsync 2024-09-19 15:53:08,829 - root - INFO - Using xformers cross attention 2024-09-19 15:53:14,565 - root - INFO - [Prompt Server] web root: D:\AI\ComfyUI_windows_portable\ComfyUI\web
{"last_node_id":15,"last_link_id":30,"nodes":[{"id":9,"type":"CogVideoDecode","pos":{"0":1280,"1":370},"size":{"0":315,"1":198},"flags":{},"order":8,"mode":0,"inputs":[{"name":"pipeline","type":"COGVIDEOPIPE","link":12},{"name":"samples","type":"LATENT","link":11}],"outputs":[{"name":"images","type":"IMAGE","links":[13],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"CogVideoDecode"},"widgets_values":[true,96,96,0.083,0.083,true]},{"id":10,"type":"VHS_VideoCombine","pos":{"0":1760,"1":380},"size":[222.35000610351562,386.80735296731467],"flags":{},"order":9,"mode":0,"inputs":[{"name":"images","type":"IMAGE","link":13},{"name":"audio","type":"AUDIO","link":null},{"name":"meta_batch","type":"VHS_BatchManager","link":null},{"name":"vae","type":"VAE","link":null}],"outputs":[{"name":"Filenames","type":"VHS_FILENAMES","links":null,"shape":3}],"properties":{"Node name for S&R":"VHS_VideoCombine"},"widgets_values":{"frame_rate":8,"loop_count":0,"filename_prefix":"cogvideo","format":"image/gif","pingpong":false,"save_output":false,"videopreview":{"hidden":false,"paused":false,"params":{"filename":"cogvideo_00003.gif","subfolder":"","type":"temp","format":"image/gif","frame_rate":8},"muted":false}}},{"id":12,"type":"CogVideoTextEncode","pos":{"0":20,"1":860},"size":{"0":400,"1":200},"flags":{},"order":4,"mode":0,"inputs":[{"name":"clip","type":"CLIP","link":16}],"outputs":[{"name":"conditioning","type":"CONDITIONING","links":[17],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"CogVideoTextEncode"},"widgets_values":["",1,true]},{"id":5,"type":"CLIPLoader","pos":{"0":-420,"1":560},"size":{"0":315,"1":82},"flags":{},"order":0,"mode":0,"inputs":[],"outputs":[{"name":"CLIP","type":"CLIP","links":[14,16],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"CLIPLoader"},"widgets_values":["t5xxl_fp8_e4m3fn.safetensors","sd3"]},{"id":1,"type":"DownloadAndLoadCogVideoModel","pos":{"0":176,"1":272},"size":{"0":315,"1":154},"flags":{},"order":1,"mode":0,"inputs":[],"outputs":[{"name":"cogvideo_pipe","type":"COGVIDEOPIPE","links":[1,28],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"DownloadAndLoadCogVideoModel"},"widgets_values":["THUDM/CogVideoX-5b-I2V","bf16","enabled","disabled",false]},{"id":7,"type":"LoadImage","pos":{"0":-500,"1":1120},"size":{"0":315,"1":314},"flags":{},"order":2,"mode":0,"inputs":[],"outputs":[{"name":"IMAGE","type":"IMAGE","links":[23],"slot_index":0,"shape":3},{"name":"MASK","type":"MASK","links":null,"shape":3}],"properties":{"Node name for S&R":"LoadImage"},"widgets_values":["Doomsday_prof7.png","image"]},{"id":11,"type":"CogVideoTextEncode","pos":{"0":30,"1":580},"size":{"0":400,"1":200},"flags":{},"order":3,"mode":0,"inputs":[{"name":"clip","type":"CLIP","link":14}],"outputs":[{"name":"conditioning","type":"CONDITIONING","links":[15],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"CogVideoTextEncode"},"widgets_values":["cartoon monster typing on a keyboard",1,true]},{"id":6,"type":"CogVideoImageEncode","pos":{"0":660,"1":1020},"size":{"0":315,"1":122},"flags":{},"order":6,"mode":0,"inputs":[{"name":"pipeline","type":"COGVIDEOPIPE","link":28},{"name":"image","type":"IMAGE","link":26},{"name":"mask","type":"MASK","link":null}],"outputs":[{"name":"samples","type":"LATENT","links":[30],"shape":3,"slot_index":0}],"properties":{"Node name for S&R":"CogVideoImageEncode"},"widgets_values":[16,true]},{"id":2,"type":"CogVideoSampler","pos":{"0":770,"1":350},"size":{"0":405.5999755859375,"1":378},"flags":{},"order":7,"mode":0,"inputs":[{"name":"pipeline","type":"COGVIDEOPIPE","link":1},{"name":"positive","type":"CONDITIONING","link":15},{"name":"negative","type":"CONDITIONING","link":17},{"name":"samples","type":"LATENT","link":null},{"name":"image_cond_latents","type":"LATENT","link":30}],"outputs":[{"name":"cogvideo_pipe","type":"COGVIDEOPIPE","links":[12],"slot_index":0,"shape":3},{"name":"samples","type":"LATENT","links":[11],"slot_index":1,"shape":3}],"properties":{"Node name for S&R":"CogVideoSampler"},"widgets_values":[480,720,48,10,6,271611582497891,"randomize","DDIM",16,8,1]},{"id":15,"type":"ImageResizeKJ","pos":{"0":51,"1":1190},"size":[315,242],"flags":{},"order":5,"mode":0,"inputs":[{"name":"image","type":"IMAGE","link":23},{"name":"get_image_size","type":"IMAGE","link":null},{"name":"width_input","type":"INT","link":null,"widget":{"name":"width_input"}},{"name":"height_input","type":"INT","link":null,"widget":{"name":"height_input"}}],"outputs":[{"name":"IMAGE","type":"IMAGE","links":[26],"shape":3,"slot_index":0},{"name":"width","type":"INT","links":null,"shape":3},{"name":"height","type":"INT","links":null,"shape":3}],"properties":{"Node name for S&R":"ImageResizeKJ"},"widgets_values":[720,480,"nearest-exact",false,16,0,0]}],"links":[[1,1,0,2,0,"COGVIDEOPIPE"],[11,2,1,9,1,"LATENT"],[12,2,0,9,0,"COGVIDEOPIPE"],[13,9,0,10,0,"IMAGE"],[14,5,0,11,0,"CLIP"],[15,11,0,2,1,"CONDITIONING"],[16,5,0,12,0,"CLIP"],[17,12,0,2,2,"CONDITIONING"],[23,7,0,15,0,"IMAGE"],[26,15,0,6,1,"IMAGE"],[28,1,0,6,0,"COGVIDEOPIPE"],[30,6,0,2,4,"LATENT"]],"groups":[],"config":{},"extra":{"ds":{"scale":0.6209213230591554,"offset":[458.29435413796347,-353.01835528580034]}},"version":0.4}