why the sample generated by accelerate command looks much better than the gif generated by the inference script

showlab / Tune-A-Video

[ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Apache License 2.0

4.15k stars 377 forks source link

prompts Iron Man is surfing in the desert

sample generated by accelerate command Iron Man is surfing in the desert

generated by inference script Iron Man is surfing in the desert

i use RTX3060 , torch 1.12, CUDA 11.6, without triton train config like this

pretrained_model_path: "./checkpoints/stable-diffusion-v1-4" output_dir: "./outputs/man-surfing"

train_data: video_path: "data/man-surfing.mp4" prompt: "a man is surfing" n_sample_frames: 8 width: 512 height: 512 sample_start_idx: 0 sample_frame_rate: 1

validation_data: prompts:

"a panda is surfing"
"a boy, wearing a birthday hat, is surfing"
"a raccoon is surfing, cartoon style"
"Iron Man is surfing in the desert" video_length: 8 width: 512 height: 512 num_inference_steps: 50 guidance_scale: 12.5 use_inv_latent: True num_inv_steps: 50

learning_rate: 3e-5 train_batch_size: 1 max_train_steps: 500 checkpointing_steps: 1000 validation_steps: 100 trainable_modules:

"attn1.to_q"
"attn2.to_q"
"attn_temp"

seed: 33 mixed_precision: fp16 use_8bit_adam: False gradient_checkpointing: True enable_xformers_memory_efficient_attention: True

... from diffusers import DDIMScheduler pretrained_model_path = "./checkpoints/stable-diffusion-v1-4" my_model_path = "./outputs/man-surfing" unet = UNet3DConditionModel.from_pretrained(my_model_path, subfolder='unet', torch_dtype=torch.float16).to('cuda') scheduler = DDIMScheduler.from_pretrained(pretrained_model_path, subfolder='scheduler') pipe = TuneAVideoPipeline.from_pretrained(pretrained_model_path, unet=unet, scheduler=scheduler, torch_dtype=torch.float16).to("cuda") ...

showlab / Tune-A-Video

why the sample generated by accelerate command looks much better than the gif generated by the inference script #60