huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.37k stars 5.25k forks source link

Stable video diffusion #5889

Closed zhw-zhang closed 8 months ago

zhw-zhang commented 10 months ago

Model/Pipeline/Scheduler description

Hello, yesterday Stable Diffusion open-sourced their image-to-video model. When it will be merged into Diffusers, and if possible, can Diffusers also provide the merged training code?

Open source status

Provide useful links for the implementation

ref: https://github.com/Stability-AI/generative-models/tree/main

drhead commented 10 months ago

I would like to add that the temporally-aware VAE decoder should be made into an easily accessible option for AnimateDiff and related pipelines as part of this. It is fully compatible with SD1.x and 2.x based models, and from my testing on AnimateDiff it greatly enhances the outputs, although it may be more memory hungry.

ShashwatNigam99 commented 10 months ago

@drhead do you mean -> (text to image Stable diffusion) + (AnimateDiff) + (the new temporally aware VAE) works pretty well? Could you share more details about your setup?

tin2tin commented 10 months ago

This is the img2vid model the open source community has been waiting and hoping for, since Runway and Pika img2vid took off. I've tested SVD and it is very capable: https://youtu.be/aEAy24d8F6E?si=27nOXdxaP29Ncjwn

However, it comes with a requirement of 40 GB VRAM. There are some optimizations tips here: https://twitter.com/timudk/status/1727064128223855087?t=lLeTOO8JYxuEcEiQm7WCWA&s=34 And Camenduru has found out that if the nsfw filter is deleted, it can be brought down to 13.1GB. But there is still a long way to consumer cards with 6 GB of VRAM. Maybe LCM, pruning, half precision, and the heavy Würstchen compression, can help bring it down in size?

charchit7 commented 10 months ago

Looks nice! @tin2tin

tumurzakov commented 10 months ago

@drhead do you mean -> (text to image Stable diffusion) + (AnimateDiff) + (the new temporally aware VAE) works pretty well? Could you share more details about your setup?

decoder.yaml

target: sgm.modules.autoencoding.temporal_ae.VideoDecoder
params:
  attn_type: vanilla-xformers
  double_z: True
  z_channels: 4
  resolution: 256   
  in_channels: 3  
  out_ch: 3
  ch: 128 
  ch_mult: [1, 2, 4, 4]
  num_res_blocks: 2
  attn_resolutions: []
  dropout: 0.0 
  video_kernel_size: [3, 1, 1]

load

from sgm.util import instantiate_from_config, load_model_from_config
from omegaconf import OmegaConf
import torch
import gc
from safetensors.torch import load_file as load_safetensors
sd = load_safetensors(f'{models}/stable_video/svd_xt.safetensors')
prefix = 'first_stage_model.decoder.'
weights = {}
for key in sd.keys():
    if prefix in key:
        weights[key.replace(prefix, '')] = sd[key]
del sd
gc.collect()
torch.cuda.empty_cache()

config = OmegaConf.load("../repos/GenerativeModels/scripts/sampling/configs/decoder.yaml")
decoder = instantiate_from_config(config)
m, e = decoder.load_state_dict(weights, strict=False)
print("missing:", len(m), "expected:", len(e))
decoder = decoder.eval().to('cuda', dtype=torch.float16)

infer

    latents = pipe(output_type='latents', **kwargs)

    video_length = latents.shape[2]
    z = rearrange(latents, 'b c f h w -> (b f) c h w')
    n_samples = 12
    n_rounds = math.ceil(z.shape[0] / n_samples)
    scale_factor = 0.18215
    z = 1.0 / scale_factor * z 
    all_out = []
    with torch.autocast("cuda", dtype=torch.float16):
        for n in tqdm(range(n_rounds)):
            timesteps = len(z[n * n_samples : (n + 1) * n_samples])
            out = decoder(z[n * n_samples : (n + 1) * n_samples], timesteps=timesteps)
            all_out.append(out)

    out = torch.cat(all_out, dim=0)
    out = rearrange(out, '(b f) c h w -> b c f h w', f=video_length)
    out = (out / 2 + 0.5).clamp(0, 1)

    out = out.detach().cpu().float()

    media.show_video(out, fps=8)

gives much more clear results then ordinary vae

painebenjamin commented 10 months ago

Just sharing useful resources.

Comfy released their built-in support for SVD just now, with a minimum VRAM requirement of a mere 8GB for 25 frames at 1024x576.

Commit adding support to infrastructure: https://github.com/comfyanonymous/ComfyUI/commit/871cc20e13e9ef2629e3b5faa6af64207e86d6d2

Commit adding nodes: https://github.com/comfyanonymous/ComfyUI/commit/42dfae63312f443d13841a0c4a5de467f5c354c9

tin2tin commented 10 months ago

https://github.com/huggingface/diffusers/pull/5895

ghost commented 10 months ago

any chance for the training script?

patrickvonplaten commented 10 months ago

Maybe best to ask directly on https://github.com/Stability-AI/generative-models

ghost commented 10 months ago

This is something that they don't have. is it possible to put together something similar from the existing diffuser libarary?

patrickvonplaten commented 10 months ago

Sure we'd more than welcome such a training script if the community is interested in creating one

antonioo-c commented 10 months ago

@patrickvonplaten Hi Patrick, I wonder if the diffusers team will work on the training code for Stable Video Diffusion pipeline? Thank you.

patrickvonplaten commented 10 months ago

We haven't planned anything yet, but we'd be more than happy to sponsor a community effort here

pixeli99 commented 9 months ago

I'm quite happy to implement a code for training, but what I'm unsure about is the usage of a new noise scheduler in SVD. I don't have much experience with this, does anyone have any suggestions for resources I could refer to?

ghost commented 9 months ago

It's just from the same paper as k-diffusion.

pixeli99 commented 9 months ago

https://github.com/pixeli99/SVD_Xtend

I hope this will be helpful to those looking to fine-tune SVD. Please be aware that this is a setup from a beginner and there may be some hidden errors, so use it selectively.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.