tumurzakov / AnimateDiff

AnimationDiff with train
Apache License 2.0
111 stars 28 forks source link

Making the frame size bigger #4

Open Don-Chad opened 1 year ago

Don-Chad commented 1 year ago

It works!

Thanks for sharing this.

Any idea how we could change the video lenght to something like 32 or 48? Longer motion would be great. At the moment it seems to be capped at 24.

It would be fine to start over, instead of using with the existing motion data set.

Error I am getting now is:

File "g:\content\animatediff\animatediff\models\motion_module.py", line 244, in forward x = x + self.pe[:, :x.size(1)] RuntimeError: The size of tensor a (32) must match the size of tensor b (24) at non-singleton dimension 1

tumurzakov commented 1 year ago

I cherry picked awesome idea from https://github.com/dajes/AnimateDiff. It in devel branch. Still working with it.

PR: https://github.com/guoyww/AnimateDiff/pull/25

Changing pe size needs to retrain model. Too expensive for me.

Don-Chad commented 1 year ago

Yes this combination would be a perfect approach! I would be happy to do new training's and provide the GPU power for it. We could also have smaller models initially.

Would you be able to make a model which does 52 motion frames? Would be very dope to have longer video's! @tumurzakov

tumurzakov commented 1 year ago

@Don-Chad I increased to 48 (24*2) by doubling pe tensors from original module and trained 1000 steps. It works well. It better than train from stretch.

Main problem not in gpu power but in dataset.

Don-Chad commented 1 year ago

Wow! Would you please want to share the pipeline_animation which is doubled? (sorry I cannot find how to do this..)

I would love to work on the dataset. I have a lot of good varied content with labels. Happy to share a new motion module.

tumurzakov commented 1 year ago

@Don-Chad very simple. Code in devel branch.

tumurzakov commented 1 year ago

Trained 96 frames on A100 for 1000 steps (20 minutes). It took 21GB VRAM. It seems on A100 can be trained up to 184 frames. Infer on A100 took 20GB VRAM. 96frames-1000

But on that frame count could be problems with pe. In AnimateDiff pe got from NLP transformer. Possibly we could try ViT positional encodings there to encode longer videos

just for fun, 48 frames on 96 frame model 48on96model

Don-Chad commented 1 year ago

Thanks kindly for sharing! Just one line makes a difference :-)

Good to see it works. Let me give it a try.

Don-Chad commented 1 year ago

@tumurzakov What difference do you think ViT can make in this regard for PE?

ezra-ch commented 1 year ago

i cant seem to use the motion_module_pe_multiplier feature

motion_module: models\Motion_Module\mmv1.5.pth
output_dir: models\Motion_Module\fff2
train_data:
  video_path: data/fff2.mp4
  prompt: girl
  n_sample_frames: 48
  width: 512
  height: 512
  sample_start_idx: 0
  sample_frame_rate: 1 #rate of sampler (how many frames it skips like sample_frame_rate 4 would make the loop +4 frames in front)
validation_data:
  prompts:
  - girl 
  video_length: 48
  temporal_context: 200
  width: 512
  height: 512
  num_inference_steps: 20
  guidance_scale: 5
  use_inv_latent: true
  num_inv_steps: 40
learning_rate: 3.0e-05
train_batch_size: 1
max_train_steps: 1000
checkpointing_steps: 100
validation_steps: 100
train_whole_module: false
trainable_modules:
- to_q
seed: 34
mixed_precision: fp16
use_8bit_adam: false
gradient_checkpointing: true
enable_xformers_memory_efficient_attention: true
motion_module_pe_multiplier: 2
  File "G:\tuneavid\AnimateDiff\train.py", line 417, in <module>
    main(**OmegaConf.load(args.config))
  File "G:\tuneavid\AnimateDiff\train.py", line 133, in main
    missing, unexpected = unet.load_state_dict(motion_module_state_dict, strict=False)
  File "G:\anaconda3\envs\tuneavid\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for UNet3DConditionModel:
        size mismatch for down_blocks.0.motion_modules.0.temporal_transformer.transformer_blocks.0.attention_blocks.0.pos_encoder.pe: copying a param with shape torch.Size([1, 48, 320]) from checkpoint, the shape in current model is torch.Size([1, 24, 320]).
tumurzakov commented 1 year ago

Here is my config for 264 frames

pretrained_model_path: /content/animatediff/models/StableDiffusion/
motion_module: /content/animatediff/models/Motion_Module/mm_sd_v15.ckpt
motion_module_pe_multiplier: 11
inference_config_path: /content/drive/MyDrive/AI/video/videos/couplet2/train-full-256/valid.yaml
start_global_step: 0
output_dir: /content/drive/MyDrive/AI/video/videos/couplet2/train-full-256
dataset_class: FramesDataset
train_data:
  samples_dir: /content/drive/MyDrive/AI/video/videos/couplet2/dataset256
  prompt_map_path: /content/drive/MyDrive/AI/video/videos/couplet2/prompt_map.json
  video_length: 264
  width: 480
  height: 272
validation_data:
  prompts:
  - standing face girl
  video_length: 264
  width: 480
  height: 272
  temporal_context: 264
  num_inference_steps: 10
  guidance_scale: 12.5
  use_inv_latent: true
  num_inv_steps: 50
learning_rate: 3.0e-05
train_batch_size: 1
max_train_steps: 2000
checkpointing_steps: 100
validation_steps: 10000
train_whole_module: true
trainable_modules:
- to_q
seed: 33
mixed_precision: fp16
use_8bit_adam: false
gradient_checkpointing: true
enable_xformers_memory_efficient_attention: true

take a look at train_data section

train_data:
  samples_dir: /content/drive/MyDrive/AI/video/videos/couplet2/dataset256
  prompt_map_path: /content/drive/MyDrive/AI/video/videos/couplet2/prompt_map.json
  video_length: 264 <---- missed
  width: 480
  height: 272