tumurzakov / AnimateDiff

AnimationDiff with train
Apache License 2.0
110 stars 28 forks source link

training issues #7

Open Cubey42 opened 1 year ago

Cubey42 commented 1 year ago

Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'prediction_type', 'variance_type'} was not found in config. Values will be initialized to default values.
loaded temporal unet's pretrained weights from I:\animetest\anime\animatediff\models\StableDiffusion\unet ...
{'num_class_embeds', 'use_linear_projection', 'upcast_attention', 'mid_block_type', 'only_cross_attention', 'dual_cross_attention', 'class_embed_type', 'resnet_time_scale_shift'} was not found in config. Values will be initialized to default values.
### missing keys: 560;
### unexpected keys: 0;
### Temporal Module Parameters: 417.1376 M
{'prediction_type'} was not found in config. Values will be initialized to default values.
07/31/2023 19:11:45 - INFO - __main__ - ***** Running training *****
07/31/2023 19:11:45 - INFO - __main__ -   Num examples = 1
07/31/2023 19:11:45 - INFO - __main__ -   Num Epochs = 1
07/31/2023 19:11:45 - INFO - __main__ -   Instantaneous batch size per device = 1
07/31/2023 19:11:45 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
07/31/2023 19:11:45 - INFO - __main__ -   Gradient Accumulation steps = 1
07/31/2023 19:11:45 - INFO - __main__ -   Total optimization steps = 1
Steps:   0%|                                                                                     | 0/1 [00:00<?, ?it/s]I:\animetest\anime\venv\lib\site-packages\torch\utils\checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
I:\animetest\anime\venv\lib\site-packages\xformers\ops\fmha\flash.py:339: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [00:23<00:00,  2.17it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00,  1.08it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 44.81it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00,  1.08it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 47.02it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00,  1.08it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 49.18it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00,  1.08it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 47.74it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00,  1.09it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 48.82it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00,  1.10it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 47.53it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00,  1.09it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 48.84it/s]```

seems to just start producing samples to infinity... 
Cubey42 commented 1 year ago

I've tried a couple different methods and schedulers and learning rates but nothing seems to have any noticeable impact, to the point where I'm not entirely sure its training the motion module at all. do you have any examples of successful finetuning?

tumurzakov commented 1 year ago

Yes.

0 (2)

Trained 1000 steps on 2 videos temporal_position_encoding_max_len=32 and it learned motion. Not perfect, but it works

patrolli commented 1 year ago

Yes.

0 (2)

Trained 1000 steps on 2 videos temporal_position_encoding_max_len=32 and it learned motion. Not perfect, but it works

Hi, here you just tune the 'to_q' param or tune the whole motion module params?

tumurzakov commented 1 year ago

Whole module

patrolli commented 1 year ago

Thanks for your reply. I tried to finetune the motion module following your training codes, while extending it to train on my video dataset instead of one video. I found that as the results of the fine-tuned training model were getting worse, it eventually produced unreasonable clips. I would like to ask if you used this code for fine-tuning, and how were some hyperparameters set, such as learning rate?

tumurzakov commented 1 year ago

I'm wrong. To-q only @patrolli

tumurzakov commented 1 year ago

I'm working on problem to stylize one video and specially overfit module with one motion. So we solving different problems. I'm still experimenting.

patrolli commented 1 year ago

Copy, tuning the entire motion module always produces worse results, which significantly harms the oringinal performance. I have also tried tuning only 'to_q,' but its results seem to be only slightly better than tuning the entire module :(

Cubey42 commented 1 year ago

Yes.

0 (2)

Trained 1000 steps on 2 videos temporal_position_encoding_max_len=32 and it learned motion. Not perfect, but it works

interesting, could I copy your .yaml for testing?

tumurzakov commented 1 year ago

Sure, i'll attach it tomorrow.

aartykov commented 1 year ago

Hey @tumurzakov , should the frames numbers of the videos be the same in the training dataset?

tumurzakov commented 1 year ago

No, dataset loader will get as much frames as it needs. Main requirement that dataset videos must contain MORE frames than video_length

tumurzakov commented 1 year ago

@Cubey42 nothing special. It configs used for training 96 frame model

pretrained_model_path: /content/animatediff/models/StableDiffusion/
motion_module: /content/drive/MyDrive/AI/video/videos/intro2/train/mm-100.pth
motion_module_pe_multiplier: 1
inference_config_path: /content/drive/MyDrive/AI/video/videos/intro2/infer/valid.yaml
start_global_step: 0
output_dir: /content/drive/MyDrive/AI/video/videos/intro2/train
train_data:
  video_path:
  - /content/drive/MyDrive/AI/video/videos/intro2/dataset/0.mp4
  - /content/drive/MyDrive/AI/video/videos/intro2/dataset/1.mp4
  prompt:
  - fly over mist
  - fly over mist
  n_sample_frames: 96
  width: 480
  height: 272
  sample_start_idx: 0
  sample_frame_rate: 1
validation_data:
  prompts:
  - fly over mist
  - fly over mist
  video_length: 96
  width: 480
  height: 272
  temporal_context: 96
  num_inference_steps: 20
  guidance_scale: 12.5
  use_inv_latent: true
  num_inv_steps: 50
learning_rate: 3.0e-05
train_batch_size: 1
max_train_steps: 1000
checkpointing_steps: 100
validation_steps: 10000
trainable_modules:
- to_q
seed: 33
mixed_precision: fp16
use_8bit_adam: false
gradient_checkpointing: true
enable_xformers_memory_efficient_attention: true
aartykov commented 1 year ago

what is this param for? n_sample_frames: 96

tumurzakov commented 1 year ago

@aartykov position encoding size. Motion module trained for 24 frames and so it can't generate more than 24 frames at once. I increased to 96 and fine tuned it. Look at another issue #4

Cubey42 commented 1 year ago

No, dataset loader will get as much frames as it needs. Main requirement that dataset videos must contain MORE frames than video_length

just to confirm, the datasetloader is just grabbing the frames it needs correct? or do longer videos with more frames give it more data? (do we want short 16 frame videos only for 16 frame training or is there benefit to going to 64 frames, etc)

tumurzakov commented 1 year ago

Loader grab only video_length frames from video. If u need more frames from one video, load this video multiple times with different starting index

Cubey42 commented 1 year ago

Okay thought so, I'm having slightly better results with your config so I will probably try so more. I've been trying to train a 512x768 video @ 16 frames, do you think I should lower it to fit into 512 for better results? (like 256x512 instead of 512x768?)

tumurzakov commented 1 year ago

Try 512x512 because it is size of unet trained on. Also, train a motions. For example video of Jordan with ball, trained it as "man is dribbling". If somebody walking, train it as "walking". If you need to train some rare motion then use rare token (sks as example)

Don-Chad commented 1 year ago

@tumurzakov Thanks for sharing again :-) I noticed you are training with two video's (1.mp4 and 1.mp4). And are you training with two video's with the same prompt on purpose (I guess to get variety?)

aartykov commented 1 year ago

I have a tiny problem. My dataset consists of small video clips. Each clip has minimum 16 frames. After resizing the frames I convert them to 4fps clips with stride of 4. However, when I play the mp4 files, the video passes so fast that I even cant regocnize the frames. Do you have any idea? @tumurzakov

https://github.com/tumurzakov/AnimateDiff/assets/18645902/b8c4b33e-8550-44c7-a9ec-8fc769db4c62

tumurzakov commented 1 year ago

@Don-Chad I need cyberpunk video of flying over mist with skyscrapers. Something like opening scene of Blade Runner. For my own purposes.

Cubey42 commented 1 year ago

maybe its just not good with character motion? I've tried different labels and such and I've added more videos but it doesn't really seem to have an impact. the couple of times I did create samples though they seem good in the samples, but once its the .pth I just don't see any of that.

tumurzakov commented 1 year ago

@Cubey42 try to train whole module, not to_q layers only. Take a look here minor change needed.

I trained whole module on skss token for 1000 steps and it just reconstruct video sample that i used

Don-Chad commented 1 year ago

@tumurzakov

Sure happy to help. I have a first secelction in a drive folder. Can you send me an email? Then I can share it -> mark at dopamine.amsterdam

Cubey42 commented 1 year ago

@Cubey42 try to train whole module, not to_q layers only. Take a look here minor change needed.

I trained whole module on skss token for 1000 steps and it just reconstruct video sample that i used

so is the change all I need to do, or should I remove to_q from the config?

tumurzakov commented 1 year ago

@Cubey42 change line

if "motion_modules" in name and name.endswith(tuple(trainable_modules)):

to

if "motion_modules" in name:

It will train whole module.

Cubey42 commented 1 year ago

okay I'll try this, thanks!

Cubey42 commented 1 year ago

after some more testing with this, I'm noticing an improvement in composition but now it has me thinking.... it seems like either the framerate of the data is too fast, or too slow... like there seems to be some minor motion present but if I want it to be faster should I increasing the sample_frame_rate ?

aartykov commented 1 year ago

Hey! I wanna my model to learn the cartoon-style motion. So I prepared small video clips from cartoon videos. Do you suggest to train the 'whole motion module' with all of the dataset or just with a few clips? @tumurzakov

aartykov commented 1 year ago

could you also share your training loss graph?

Cubey42 commented 1 year ago

@tumurzakov I've had more success training the whole module, thank you. I have a large dataset that is already configured for a dataset that was built for Text-To-Video-Finetuning, but if possible can I use a different dataset loader (VideoJsonDataset)?

patrolli commented 1 year ago

Hi,could you share your generated examples after training the whole module, many thanks.发自我的 iPhone在 2023年8月8日,00:53,Cubey42 @.***> 写道: @tumurzakov I've had more success training the whole module, thank you. I have a large dataset that is already configured for a dataset that was built for Text-To-Video-Finetuning, but if possible can I use a different dataset loader (VideoJsonDataset)?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

aartykov commented 1 year ago

What training method did u use, lora or dreambooth here? @tumurzakov

tumurzakov commented 1 year ago

@patrolli https://arxiv.org/pdf/2307.04725.pdf

During experiments,
we discovered that using a diffusion schedule slightly dif-
ferent from the original schedule where the base T2I model
was trained helps achieve better visual quality and avoid
artifacts such as low saturability and flickering. We hy-
pothesize that slightly modifying the original schedule can
help the model better adapt to new tasks (animation) and
new data distribution. Thus, we used a linear beta sched-
ule, where βstart = 0.00085 and βend = 0.012, which is slightly
 different from that used to train the original SD.

Change file /content/animatediff/models/StableDiffusion/scheduler/scheduler_config.json to

{
  "_class_name": "PNDMScheduler",
  "_diffusers_version": "0.6.0",
  "beta_end": 0.012,
  "beta_schedule": "linear",
  "beta_start": 0.00085,
  "num_train_timesteps": 1000,
  "set_alpha_to_one": false,
  "skip_prk_steps": true,
  "steps_offset": 1,
  "trained_betas": null,
  "clip_sample": false
}
Cubey42 commented 1 year ago

is this for training or just all animatediff work?

Cubey42 commented 1 year ago

@tumurzakov I'm not very familiar with all the modules, but would happen to know which module handles the style/colors? I'd like to exclude it while doing the other modules.

aartykov commented 1 year ago

My observation about the motion module training process so far is that the code perfectly overfits when you train with one video clip. I guess, it gives better results if your clip includes a specific motion and it is long enough.

aartykov commented 1 year ago

The main drawback is it also learns the texture, color and other stuffs...

Cubey42 commented 1 year ago

My observation about the motion module training process so far is that the code perfectly overfits when you train with one video clip. I guess, it gives better results if your clip includes a specific motion and it is long enough.

yeah my best success has been 2 identical videos, I also feel changing the sample_frame_rate has helped with faster motions, but I haven't quite understood the ideal setting, increasing it also speeds up training.

aartykov commented 1 year ago

sample_frame_rate decreases the fps, that is why it helps with. faster motion.

Cubey42 commented 1 year ago

sample_frame_rate decreases the fps, that is why it helps with. faster motion.

do you have a preference in your finetuning? I found low options like 1 cause no movement, while somewhere in the 12~15 seems to give me the most motion

aartykov commented 1 year ago

since my video clips are very short, I can only use 1

Cubey42 commented 1 year ago

and you get decent motion with 1? I don't understand why it feels like I get no movement at 1

aartykov commented 1 year ago

try using higher fps video with this parameter set to 1

aartykov commented 1 year ago

Guys do you have any achievements so far? @tumurzakov @Cubey42

tumurzakov commented 1 year ago

@aartykov I'm finetune for style transfer. For example I need a cyberpunk driving video. I finetuned it with 1000 steps of manhattan driving videos. And it works as I need. 11 12 13

aartykov commented 1 year ago

Looks awesome! How many videos does your dataset include? And how long is each video? @tumurzakov

tumurzakov commented 1 year ago

@aartykov I'm cutting 16 frame videos from bigger one. I'm using 1:1 ratio for training steps. Better works if motions are same on most of videos. If I need two motions on video I separately fine tune one motion and then other. I'm using 100 step checkpoints and after training choosing one that fits better for my purpose. Often I'm using 300-500 step checkpoint from 1000 step training. Sometime 1000 steps resulting horrible overfit, other time it works well. Don't know why, may be if there more small details train tends to overfit.

aartykov commented 1 year ago

got you. Btw may I add you on Linkedin?