Why is the performance of my model becoming worse as I continue training?（为啥我的模型效果越训越差？）

showlab / Tune-A-Video

[ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Apache License 2.0

4.15k stars 377 forks source link

100 steps的时候还好，越往后越没有效果？难度是图片宽高缩小以后，要调整什么东西？（when steps is 100， the output is ok.But the performance is getting worse as the training progresses！do I have to adjust sth after resizing the Height and width？）我的配置是 train_data: video_path: "data/man-skiing.mp4" prompt: "a man is skiing" n_sample_frames: 24 width: 256 height: 256 sample_start_idx: 0 sample_frame_rate: 2

validation_data: prompts:

"mickey mouse is skiing on the snow"
"spider man is skiing on the beach, cartoon style"
"wonder woman, wearing a cowboy hat, is skiing"
"a man, wearing pink clothes, is skiing at sunset" video_length: 24 width: 256 height: 256 num_inference_steps: 50 guidance_scale: 12.5 use_inv_latent: True num_inv_steps: 50

learning_rate: 3e-5 train_batch_size: 1 max_train_steps: 500 checkpointing_steps: 1000 validation_steps: 100 trainable_modules:

"attn1.to_q"
"attn2.to_q"
"attn_temp"

seed: 33 mixed_precision: fp16 use_8bit_adam: False gradient_checkpointing: True enable_xformers_memory_efficient_attention: False

we have also noticed that the performance of our models tends to degrade when lower resolution videos (e.g., 256 x 256) are used. our hypothesis is that this is caused by the pretrained SD models, which were trained on higher resolution images (e.g., 512 x 512). we recommend using higher resolution videos such as 384 x 384 or 512 x 512 for better performance.

我们也注意到，在使用较低分辨率的视频（例如，256 x 256）时，我们的模型性能会下降。我们推测这是由于预训练的SD模型是在更高分辨率的图像（例如，512 x 512）上训练的所致。为了获得更好的性能，我们建议使用更高分辨率的视频（例如384 x 384或512 x 512）。

showlab / Tune-A-Video

Why is the performance of my model becoming worse as I continue training?（为啥我的模型效果越训越差？） #57