Closed zhw-zhang closed 8 months ago
I would like to add that the temporally-aware VAE decoder should be made into an easily accessible option for AnimateDiff and related pipelines as part of this. It is fully compatible with SD1.x and 2.x based models, and from my testing on AnimateDiff it greatly enhances the outputs, although it may be more memory hungry.
@drhead do you mean -> (text to image Stable diffusion) + (AnimateDiff) + (the new temporally aware VAE) works pretty well? Could you share more details about your setup?
This is the img2vid model the open source community has been waiting and hoping for, since Runway and Pika img2vid took off. I've tested SVD and it is very capable: https://youtu.be/aEAy24d8F6E?si=27nOXdxaP29Ncjwn
However, it comes with a requirement of 40 GB VRAM. There are some optimizations tips here: https://twitter.com/timudk/status/1727064128223855087?t=lLeTOO8JYxuEcEiQm7WCWA&s=34 And Camenduru has found out that if the nsfw filter is deleted, it can be brought down to 13.1GB. But there is still a long way to consumer cards with 6 GB of VRAM. Maybe LCM, pruning, half precision, and the heavy Würstchen compression, can help bring it down in size?
Looks nice! @tin2tin
@drhead do you mean -> (text to image Stable diffusion) + (AnimateDiff) + (the new temporally aware VAE) works pretty well? Could you share more details about your setup?
decoder.yaml
target: sgm.modules.autoencoding.temporal_ae.VideoDecoder
params:
attn_type: vanilla-xformers
double_z: True
z_channels: 4
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [1, 2, 4, 4]
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
video_kernel_size: [3, 1, 1]
load
from sgm.util import instantiate_from_config, load_model_from_config
from omegaconf import OmegaConf
import torch
import gc
from safetensors.torch import load_file as load_safetensors
sd = load_safetensors(f'{models}/stable_video/svd_xt.safetensors')
prefix = 'first_stage_model.decoder.'
weights = {}
for key in sd.keys():
if prefix in key:
weights[key.replace(prefix, '')] = sd[key]
del sd
gc.collect()
torch.cuda.empty_cache()
config = OmegaConf.load("../repos/GenerativeModels/scripts/sampling/configs/decoder.yaml")
decoder = instantiate_from_config(config)
m, e = decoder.load_state_dict(weights, strict=False)
print("missing:", len(m), "expected:", len(e))
decoder = decoder.eval().to('cuda', dtype=torch.float16)
infer
latents = pipe(output_type='latents', **kwargs)
video_length = latents.shape[2]
z = rearrange(latents, 'b c f h w -> (b f) c h w')
n_samples = 12
n_rounds = math.ceil(z.shape[0] / n_samples)
scale_factor = 0.18215
z = 1.0 / scale_factor * z
all_out = []
with torch.autocast("cuda", dtype=torch.float16):
for n in tqdm(range(n_rounds)):
timesteps = len(z[n * n_samples : (n + 1) * n_samples])
out = decoder(z[n * n_samples : (n + 1) * n_samples], timesteps=timesteps)
all_out.append(out)
out = torch.cat(all_out, dim=0)
out = rearrange(out, '(b f) c h w -> b c f h w', f=video_length)
out = (out / 2 + 0.5).clamp(0, 1)
out = out.detach().cpu().float()
media.show_video(out, fps=8)
gives much more clear results then ordinary vae
Just sharing useful resources.
Comfy released their built-in support for SVD just now, with a minimum VRAM requirement of a mere 8GB for 25 frames at 1024x576.
Commit adding support to infrastructure: https://github.com/comfyanonymous/ComfyUI/commit/871cc20e13e9ef2629e3b5faa6af64207e86d6d2
Commit adding nodes: https://github.com/comfyanonymous/ComfyUI/commit/42dfae63312f443d13841a0c4a5de467f5c354c9
any chance for the training script?
Maybe best to ask directly on https://github.com/Stability-AI/generative-models
This is something that they don't have. is it possible to put together something similar from the existing diffuser libarary?
Sure we'd more than welcome such a training script if the community is interested in creating one
@patrickvonplaten Hi Patrick, I wonder if the diffusers team will work on the training code for Stable Video Diffusion pipeline? Thank you.
We haven't planned anything yet, but we'd be more than happy to sponsor a community effort here
I'm quite happy to implement a code for training, but what I'm unsure about is the usage of a new noise scheduler in SVD. I don't have much experience with this, does anyone have any suggestions for resources I could refer to?
It's just from the same paper as k-diffusion.
https://github.com/pixeli99/SVD_Xtend
I hope this will be helpful to those looking to fine-tune SVD. Please be aware that this is a setup from a beginner and there may be some hidden errors, so use it selectively.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Model/Pipeline/Scheduler description
Hello, yesterday Stable Diffusion open-sourced their image-to-video model. When it will be merged into Diffusers, and if possible, can Diffusers also provide the merged training code?
Open source status
Provide useful links for the implementation
ref: https://github.com/Stability-AI/generative-models/tree/main