kabachuha / sd-webui-text2video

Auto1111 extension implementing text2video diffusion models (like ModelScope or VideoCrafter) using only Auto1111 webui dependencies
Other
1.28k stars 107 forks source link

[Bug]: On generation, It samples the tensors twice and vid2vid doesn't work. #62

Closed Grendar1 closed 1 year ago

Grendar1 commented 1 year ago

Is there an existing issue for this?

Are you using the latest version of the extension?

What happened?

Every time I click "generate" it tries to sample the tensors two times instead of only once. That means I have to wait double the time for only one video. I understand vid2vid got added, but even so, when I tried to do vid2vid generation, It outputs only from the text2video tab, even if samples the tensors two times.

Steps to reproduce the problem

For text2video and vid2vid

  1. Go to ModelScope text2video
  2. Add a prompt, for example "sunrise from tokyo, by makoto shinkai"
  3. Click the yellow "Generate" button.
  4. Waiting twice the time. For vid2vid
  5. Add a video that i got generated from text2video
  6. Add a prompt, for example "a boy with sunglesses"
  7. Click generate and still waiting twice the time because it does the sampling of the tesnsors twice.
  8. Finding out that vid2vid doesn't generate from the vid2vid tab, but from text2video (which i left blank and outputs something else, like a tortoise underwater)

What should have happened?

It should sample the tensors only once if using only text2video It should sample the tensors twice if I make add the prompts for both text2video and vid2vid. It should sample the tensors once if only vid2vid was selected.

WebUI and Deforum extension Commit IDs

webui commit id - commit: a9fed7c3 txt2vid commit id -//github.com/deforum-art/sd-webui-modelscope-text2video.git | 84020058 (Fri Mar 24 14:49:52 2023)

What GPU were you using for launching?

RTX 3060 12GB VRAM, 16 GB Ram.

On which platform are you launching the webui backend with the extension?

Local PC setup (Windows)

Settings

--xformers --no-half-vae --api I didn't change nothing, I just added the prompts, everything is at default with fp16 enabled for the gpu

Console logs

ModelScope text2video extension for auto1111 webui
Git commit: 84020058 (Fri Mar 24 14:49:52 2023)
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
latents torch.Size([1, 4, 24, 32, 32]) tensor(-0.0012, device='cuda:0') tensor(1.0001, device='cuda:0')
DDIM sampling tensor(1): 100%|███████████████████████████████████████| 31/31 [00:41<00:00,  1.33s/it]
STARTING VAE ON GPU. 24 CHUNKS TO PROCESS
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([24, 3, 256, 256])
output/mp4s/20230324_215112403414.mp4
  0%|                                                                          | 0/1 [00:00<?, ?it/s]latents torch.Size([1, 4, 24, 32, 32]) tensor(-0.0007, device='cuda:0') tensor(1.0037, device='cuda:0')DDIM sampling tensor(1): 100%|███████████████████████████████████████| 31/31 [00:41<00:00,  1.34s/it]
STARTING VAE ON GPU. 24 CHUNKS TO PROCESS████████████████████████████| 31/31 [00:41<00:00,  1.34s/it]
VAE HALVED
DECODING FRAMES
VAE FINISHED
torch.Size([24, 3, 256, 256])
output/mp4s/20230324_215201616361.mp4
text2video finished, saving frames to C:\Stable Diffusion\stable-diffusion-webui\outputs/img2img-images\text2video-modelscope\20230324215000
Got a request to stitch frames to video using FFmpeg.
Frames:
C:\Stable Diffusion\stable-diffusion-webui\outputs/img2img-images\text2video-modelscope\20230324215000\%06d.png
To Video:
C:\Stable Diffusion\stable-diffusion-webui\outputs/img2img-images\text2video-modelscope\20230324215000\vid.mp4
Stitching *video*...
Stitching *video*...
Video stitching done in 0.26 seconds!
t2v complete, result saved at C:\Stable Diffusion\stable-diffusion-webui\outputs/img2img-images\text2video-modelscope\20230324215000

Additional information

No response

toyxyz commented 1 year ago

Same here

kabachuha commented 1 year ago

Have you played around with Denoising strength?

If it's at 1, it means full change

jav12z commented 1 year ago

Have you played around with Denoising strength?

If it's at 1, it means full change

It doesn't work, even at Denoising strength 0 it gives a video not related with the sample

Compviztr commented 1 year ago

Same. Vid2vid uses the txt2vid input and doesn’t appear to use the uploaded video.

hithereai commented 1 year ago

I can confirm that vid2vid doesn't work guys. Please revert to an older version or wait for our fix, but it might take 24 hours or more since it's the weekend and we need some time off <3

Watch out for updates anyways. And thanks for providing feedback!

toyxyz commented 1 year ago

When I changed it to the commit of Mar 23, 2023, vid2vid works. It seems that the update after that is the cause of the problem. 1b0385a707195ea785f99313031699f2f9c86e27