Generating different number of frames

bonlime commented 1 year ago

Hi! Thanks for a very interesting paper, I wonder if you've tried generating shorter/longer clips? I see that there is temporal_position_encoding_max_len=24 which limits the length to be 24 frames, but what about shorter clips?

Also I'm struggling to understand what is the shape of attention in Temporal Transformer? Here you resize (b f) d c -> (b d) f c where the batch (b) is probably 1 and the frames (f) is probably 16, (d) corresponds to reshaped features right? So each "super-pixel" is processed separately and the shape of attentions maps should be (B * D) x F x F which isn't really big. Why does then the inference take 60 Gb?

bonlime commented 1 year ago

upd. after moving your codebase to latest diffusers==0.18.1 + latest torch 2.0 I was able to generate videos on T4 with 15 Gb of VRAM

AI-Casanova commented 1 year ago

@bonlime What length and resolution are you getting on a T4? my current max is --L 10 --W 320 --H 384

Also are you generating a shutterstock watermark? I can't seem to be rid of it

bonlime commented 1 year ago

@AI-Casanova I was able to generate with default params L=16, W=H=512 on T4. But I changed some code in modules and also changes decode_latents to do it sequentially rather than trying to decode the whole batch, which was giving me OOM. Here is a shitty sample i got (but this is vanilla SD 1.5)

2-best-quality,-masterpiece,-1boy,-formal,-abstract,-looking-at-viewer,-masculine,

guoyww commented 1 year ago

upd. after moving your codebase to latest diffusers==0.18.1 + latest torch 2.0 I was able to generate videos on T4 with 15 Gb of VRAM

Thanks for the kind reminds @bonlime :) We just found some tricks that help reduce memory costs a lot and will update our codebase asap.

btw, did you use xformers or other techniques to lower the memory?

AI-Casanova commented 1 year ago

@bonlime AH I see you have the watermark too, might be in the motion model?

I'd like to see the sequential decode changes if you don't mind.

@guoyww I forgot xformers at first, but even with it enabled I'm not able to generate at --L 16 --W 512 --H 512 on a Colab T4, likely due to VAE decode

bonlime commented 1 year ago

I'm not using xformers, but I switched to torch2.0 and use their efficient attention. @AI-Casanova I also see shutter-stock watermarks on all generated images and can't get rid of it

guoyww commented 1 year ago

that's correct, because every video in our training dataset WebVid has a watermark @AI-Casanova

AI-Casanova commented 1 year ago

Understood

0x1355 commented 1 year ago

Interesting. Why don't the example videos have watermarks then?

bonlime commented 1 year ago

@0x1355 I get them in 9/10 examples. maybe authors cherry-picked seeds that don't produce watermarks

0x1355 commented 1 year ago

Thanks @bonlime . You are fast!

@guoyww How much compute was used to trained each motion model? I couldn't seem to find that in the paper or in the repo? Minimum VRAM? Do you plan to share the training code too? It would be amazing.

guoyww commented 1 year ago

x-CK-x reacted with eyes emoji bonlime changed the title ~Generating various number of frames~ Generating different number of frames

Thanks @bonlime . You are fast!

@guoyww How much compute was used to trained each motion model? I couldn't seem to find that in the paper or in the repo? Minimum VRAM? Do you plan to share the training code too? It would be amazing.

We trained the module using 8 A100s for ~5 days, and the training scripts will be released in the future, thanks :)

CiaoHe commented 1 year ago

I tried inference 24 frames using demo prompts but found performance degrade

Inference 16 Frames (default)

1-min

1-ToonYou; Top: mm_sd_v14.ckpt; Bottom:mm_sd_v15.ckpt

2-min

2-Lyriel; Top: mm_sd_v14.ckpt; Bottom:mm_sd_v15.ckpt

Inference 24 Frames (max_len)

1L-base-sd14-min

1-ToonYou; Top: mm_sd_v14.ckpt; Bottom:mm_sd_v15.ckpt

2L-base-sd14-min

2-Lyriel; Top: mm_sd_v14.ckpt; Bottom:mm_sd_v15.ckpt

Hellisotherpeople commented 1 year ago

The shutterstock watermark was absent in every one of my gens in @continue-revolution extension until he "fixed" his code in version 1.2.0, where it's now in every gif. I've had to revert back to version 1.1.0 of his extension. He rather unjustly closed my github issue and linked here, but the reality is that if the old version was "broken", than "broken" is better.

Pretty sad to close a legit reproducible issue within 7 minutes and just link to a thread like this.

guoyww / AnimateDiff

Generating different number of frames #4

Inference 16 Frames (default)

Inference 24 Frames (max_len)