kijai / ComfyUI-CogVideoXWrapper

983 stars 59 forks source link

Less an issue - more banging a hammer at 1.5B #256

Closed KrakeyMTL closed 2 days ago

KrakeyMTL commented 2 days ago

Hi everyone, and of course Kijai!

1.5B is out, thanks for the update to the wrapper. Took a spell to fix the nodes man..so many changes but that is alright!

Thank you very much for the hard work.

In messing around I see we can reduce the FPS in saving from 16 down to 12/10ish and it does in fact extend the video length by a second or two but there has to be a better way. Using Img2Vid model, 1.5B, gives me a 3 sec hard limit on the best settings.

Anyone figure out a way to preserve the 16fps and get those coveted 10 second long videos??

Also the quality output of the 1.5B is vastly superior to anything so far - so good!

https://github.com/user-attachments/assets/75562e5f-7265-4dfb-8626-2cc6ed444bcb

https://github.com/user-attachments/assets/2b0872a3-9d26-4e07-ae1c-63ef005345b4

KrakeyMTL commented 2 days ago

This is exact settings as above just with FPS in the saving node set to 12. (gives us 4sec video)

https://github.com/user-attachments/assets/8a754bde-ae64-48b9-b686-b30c17a096f1

KrakeyMTL commented 2 days ago

And last one to not spam up the thread too much -> this is using FPS 8 in the save box. This gives us 6 sec it seems.

https://github.com/user-attachments/assets/c37a3876-66c5-485c-b112-83d301a41e82


I'll be experimenting with a few things but please if anyone else has a tweak for the I2V to get those 10 sec vids, please post up.**

Also if anyone has a Text2Vid json for the 1.5 could you post it up please?? I have not written my node update one yet I'll make it later if someone has would save some time. Thanks!

kijai commented 2 days ago

The FPS in the video combine is literally just the frame rate the video is played at, they have claimed the model could do 161 frames which would be 10 seconds at 16fps. I find that 81 frames is more reasonable target, which when played at 8fps would be 10 seconds long as well.

KrakeyMTL commented 2 days ago

@kijai I heard that too so I just setup a test lol - great minds eh haha

The L40 however is cuda core maxed doing this and will take 39ish minutes omg... the results better be ace haha, I'll post them up when done. Thank you for the math you did I'll try that next.

It's also interesting to know that you can squeeze it all into a 48Gb vram card here with some overhead to spare..not much but it's doable. I'll be loading this on my A6000 in the next few days to test. Who cares if it takes an hour when the rental cost is like 30 cents you know? Thanks bud!

15B_16fps_161numframe test

zazoum-art commented 2 days ago

Let it take forever it is doable. I have done 185 with 4090. Offload everything.

KrakeyMTL commented 2 days ago

52min for rubbish. Some tweaks needed eh! lol

but yes the key here we find out is the num_frames in the end. This was 161

https://github.com/user-attachments/assets/2ae9fc84-f32d-4aa3-95eb-54b06c3c8968

zazoum-art commented 2 days ago

What were your prompts?

KrakeyMTL commented 2 days ago

The image is from my own custom SDXL unpublished model fyi, but I was messing around with this only -> "one classic proportion burger, intricate, elegant, highly detailed, glossy haze, concept art, soft, sharp focus, bar background"

I wanted to know more the length control before going nuts. I am now running a 81 num_frame for fun to see and will post up but with what you all replied this thread can pretty much be closed after as info. We nailed this down to what is what here.

And yes, zazoum the prompt isn't complex enough to really work the model I'm aware. I'll move probably to this prompt next "Create a cinematic video showcasing a classic proportion burger. Begin with a close-up of glossy, fresh ingredients (lettuce, tomato, onion, cheese) falling gracefully in slow motion onto the patty, layered on a toasted bun. Show the burger being assembled with intricate, elegant detail, highlighting the textures and glossy surfaces. Add a soft haze in the background, transitioning to a sharp focus as the final burger is revealed. The setting is a warmly lit, sophisticated bar with soft ambient light, and gentle reflections dance across the counter. End with a 360-degree slow rotation of the completed burger, emphasizing its elegant design and intricate craftsmanship."

All the video models require dictionary level prompting to really wry everything from them - I learned this while doing my ice rally car videos phew...if there isn't enough description the model just fills in whatever hehe.

zazoum-art commented 2 days ago

Flux needs different prompting than sd1.5. Same here. There are prompt generators for cog.

Didn't read your whole post.

KrakeyMTL commented 2 days ago

why reply if you aren't going to read everything nor post a link to the cog generators?

Pretty rude/lame, I'll just ignore you now.

zazoum-art commented 2 days ago

Google is out there! Like the truth.

KrakeyMTL commented 2 days ago

@zazoum-art and with that you are just spamming being a chode adding nothing like the good little basement mouth breather you are.

Thread closed info is above for anyone to ref.

Num_frames is key.

Thanks again Kijai!