kabachuha / sd-webui-text2video

Auto1111 extension implementing text2video diffusion models (like ModelScope or VideoCrafter) using only Auto1111 webui dependencies
Other
1.28k stars 107 forks source link

[Bug]: Vid2Vid no longer creates videos of the correct frame count #116

Closed B34STW4RS closed 1 year ago

B34STW4RS commented 1 year ago

Is there an existing issue for this?

Are you using the latest version of the extension?

What happened?

mid post edit: ugh.... it looks like there is a mistake in the ui, if you change the framecount on the text2video tab it will affect the framecount on the vid2vid tab instead and the slider on the vid2vid tab does nothing.

Updated to newest build.

Processed a 62 frame video...

Output was a 24 frame video...

Steps to reproduce the problem

Tried with multiple videos, same result every time.

The output is always 24 frames.

What should have happened?

Previous build, I would receive an output of the exact length of the input.

ex: 125 length in/ 125 length out 60 length in/ 60 length out

etc.

WebUI and Deforum extension Commit IDs

webui commit id - [22bcc7be] txt2vid commit id -67aaba9f (Sat Apr 15 23:35:17 2023)

What GPU were you using for launching?

4090 oc 24 gb

On which platform are you launching the webui backend with the extension?

Local PC setup (Windows)

Settings

not relevant same in all configurations

Console logs

text2video — The model selected is:  ModelScope
 text2video extension for auto1111 webui
Git commit: 67aaba9f (Sat Apr 15 23:35:17 2023)
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
got a request to *vid2vid* an existing video.
Trying to extract frames from video with input FPS of 20.0. Please wait patiently.
Successfully extracted 62.0 frames from video.
Loading frames: 100%|██████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 57.20it/s]
Converted the frames to tensor (1, 25, 3, 256, 256)
Computing latents
STARTING VAE ON GPU
VAE HALVED
Working in vid2vid mode
  0%|                                                                                    | 0/1 [00:00<?, ?it/s]latents torch.Size([1, 4, 25, 32, 32]) tensor(-0.0405, device='cuda:0') tensor(0.9209, device='cuda:0')
huh tensor(793) tensor([793], device='cuda:0')
DDIM sampling tensor(1): 100%|█████████████████████████████████████████████████| 24/24 [00:06<00:00,  3.62it/s]
STARTING VAE ON GPU. 13 CHUNKS TO PROCESS██████████████████████████████████████| 24/24 [00:06<00:00,  3.69it/s]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([25, 3, 256, 256])
output/mp4s/20230416_044844570094.mp4
text2video finished, saving frames to D:\NasD\stable-diffusion-webui\outputs/img2img-images\text2video\20230416044810
Got a request to stitch frames to video using FFmpeg.
Frames:
D:\NasD\stable-diffusion-webui\outputs/img2img-images\text2video\20230416044810\%06d.png
To Video:
D:\NasD\stable-diffusion-webui\outputs/img2img-images\text2video\20230416044810\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.76 seconds!
t2v complete, result saved at D:\NasD\stable-diffusion-webui\outputs/img2img-images\text2video\20230416044810

Additional information

What seems to be the problem is the default generation parameter for framecount is not being passed to the pipeline when hitting generate, and as such the extension will always generate only 24 frames in vid2vid mode.

B34STW4RS commented 1 year ago

After some further checks, it looks like most of the vid2vid sliders are busted and are controlled by the text2video tab instead...

kabachuha commented 1 year ago

Thanks! I'll look into that. Seems like it's related to how the code was reworked to enable the WebAPI

kabachuha commented 1 year ago

Found the issue