Open bonlime opened 1 year ago
upd. after moving your codebase to latest diffusers==0.18.1 + latest torch 2.0 I was able to generate videos on T4 with 15 Gb of VRAM
@bonlime What length and resolution are you getting on a T4? my current max is --L 10 --W 320 --H 384
Also are you generating a shutterstock watermark? I can't seem to be rid of it
@AI-Casanova I was able to generate with default params L=16, W=H=512 on T4. But I changed some code in modules and also changes decode_latents
to do it sequentially rather than trying to decode the whole batch, which was giving me OOM. Here is a shitty sample i got (but this is vanilla SD 1.5)
upd. after moving your codebase to latest diffusers==0.18.1 + latest torch 2.0 I was able to generate videos on T4 with 15 Gb of VRAM
Thanks for the kind reminds @bonlime :) We just found some tricks that help reduce memory costs a lot and will update our codebase asap.
btw, did you use xformers or other techniques to lower the memory?
@bonlime AH I see you have the watermark too, might be in the motion model?
I'd like to see the sequential decode changes if you don't mind.
@guoyww I forgot xformers at first, but even with it enabled I'm not able to generate at --L 16 --W 512 --H 512
on a Colab T4, likely due to VAE decode
I'm not using xformers, but I switched to torch2.0 and use their efficient attention. @AI-Casanova I also see shutter-stock watermarks on all generated images and can't get rid of it
that's correct, because every video in our training dataset WebVid has a watermark @AI-Casanova
Understood
Interesting. Why don't the example videos have watermarks then?
@0x1355 I get them in 9/10 examples. maybe authors cherry-picked seeds that don't produce watermarks
Thanks @bonlime . You are fast!
@guoyww How much compute was used to trained each motion model? I couldn't seem to find that in the paper or in the repo? Minimum VRAM? Do you plan to share the training code too? It would be amazing.
x-CK-x reacted with eyes emoji bonlime changed the title ~Generating various number of frames~ Generating different number of frames
Thanks @bonlime . You are fast!
@guoyww How much compute was used to trained each motion model? I couldn't seem to find that in the paper or in the repo? Minimum VRAM? Do you plan to share the training code too? It would be amazing.
We trained the module using 8 A100s for ~5 days, and the training scripts will be released in the future, thanks :)
I tried inference 24 frames using demo prompts but found performance degrade
1-ToonYou; Top: mm_sd_v14.ckpt; Bottom:mm_sd_v15.ckpt
2-Lyriel; Top: mm_sd_v14.ckpt; Bottom:mm_sd_v15.ckpt
1-ToonYou; Top: mm_sd_v14.ckpt; Bottom:mm_sd_v15.ckpt
2-Lyriel; Top: mm_sd_v14.ckpt; Bottom:mm_sd_v15.ckpt
The shutterstock watermark was absent in every one of my gens in @continue-revolution extension until he "fixed" his code in version 1.2.0, where it's now in every gif. I've had to revert back to version 1.1.0 of his extension. He rather unjustly closed my github issue and linked here, but the reality is that if the old version was "broken", than "broken" is better.
Pretty sad to close a legit reproducible issue within 7 minutes and just link to a thread like this.
Hi! Thanks for a very interesting paper, I wonder if you've tried generating shorter/longer clips? I see that there is
temporal_position_encoding_max_len=24
which limits the length to be 24 frames, but what about shorter clips?Also I'm struggling to understand what is the shape of attention in Temporal Transformer? Here you resize
(b f) d c -> (b d) f c
where the batch (b) is probably 1 and the frames (f) is probably 16, (d) corresponds to reshaped features right? So each "super-pixel" is processed separately and the shape of attentions maps should be(B * D) x F x F
which isn't really big. Why does then the inference take 60 Gb?