[Needs help]: VideoCrafter bugs - only generating 1 second clip and adds audio regardless of setting

Scruntee commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits of both this extension and the webui

Are you using the latest version of the extension?

[X] I have the modelscope text2video extension updated to the lastest version and I still have the issue.

What happened?

Generating a video regardless of number of frames and video frame rate makes a 1 second video with audio even with audio ticked off. Also I only see the video file saved and not the frames for the video as the previous ModelScope video did.

Steps to reproduce the problem

Go to ....
Press ....
...

What should have happened?

No response

WebUI and Deforum extension Commit IDs

webui commit id - commit: 226d840e txt2vid commit id -

What GPU were you using for launching?

2080 super

On which platform are you launching the webui backend with the extension?

No response

Settings

Console logs

To create a public link, set `share=True` in `launch()`.
text2video — The model selected is:  VideoCrafter
 text2video extension for auto1111 webui
Git commit: Unknown
VideoCrafter config:
 {'model': {'target': 'lvdm.models.ddpm3d.LatentDiffusion', 'params': {'linear_start': 0.00085, 'linear_end': 0.012, 'num_timesteps_cond': 1, 'log_every_t': 200, 'timesteps': 1000, 'first_stage_key': 'video', 'cond_stage_key': 'caption', 'image_size': [32, 32], 'video_length': 16, 'channels': 4, 'cond_stage_trainable': False, 'conditioning_key': 'crossattn', 'scale_by_std': False, 'scale_factor': 0.18215, 'unet_config': {'target': 'lvdm.models.modules.openaimodel3d.UNetModel', 'params': {'image_size': 32, 'in_channels': 4, 'out_channels': 4, 'model_channels': 320, 'attention_resolutions': [4, 2, 1], 'num_res_blocks': 2, 'channel_mult': [1, 2, 4, 4], 'num_heads': 8, 'transformer_depth': 1, 'context_dim': 768, 'use_checkpoint': True, 'legacy': False, 'kernel_size_t': 1, 'padding_t': 0, 'temporal_length': 16, 'use_relative_position': True}}, 'first_stage_config': {'target': 'lvdm.models.autoencoder.AutoencoderKL', 'params': {'embed_dim': 4, 'monitor': 'val/rec_loss', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}, 'lossconfig': {'target': 'torch.nn.Identity'}}}, 'cond_stage_config': {'target': 'lvdm.models.modules.condition_modules.FrozenCLIPEmbedder'}}}}
Loading model from C:\Users\Conner\OneDrive\stable-diffusion-webui\models/VideoCrafter/model.ckpt
LatentDiffusion: Running in eps-prediction mode
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.
Successfully initialize the diffusion model !
DiffusionWrapper has 958.92 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Sampling Batches (text-to-video): 100%|██████████████████████████████████████████████████| 1/1 [00:36<00:00, 36.57s/it]
text2video finished, saving frames to C:\Users\Conner\OneDrive\stable-diffusion-webui\outputs/img2img-images\text2video\20230405191018
Adding empty frames: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1998.24it/s]
Making grids: 100%|█████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 32002.32it/s]
Adding soundtrack to *video*...                                                                 | 0/16 [00:00<?, ?it/s]
FFmpeg Audio stitching done in 1.05 seconds!
t2v complete, result saved at C:\Users\Conner\OneDrive\stable-diffusion-webui\outputs/img2img-images\text2video\20230405191018
Finish sampling!
Run time = 39.16 seconds

Additional information

No response

Jonseed commented 1 year ago

I'm also seeing the same issues. Adds audio, even when add soundtrack is set to "none," and there is only the video file, no individual frames saved.

Jonseed commented 1 year ago

I only get 16 frames, no matter how many I specify.

rookiemann commented 1 year ago

Same issues. I have a post in 'Discussions' where I'm having problems with Docker so I installed everything the regular way. Tries VideoCrafter and get the soundtrack added when unwanted and generates just one second of frames.

justinwking commented 1 year ago

I am having the same results as jonseed and rookiemann I am using Automatic1111 on windows10 with 3090 external GPU.

BillarySquintin commented 1 year ago

To fix the audio being added, just comment out this line in the text2vid.py in the "process_vidcrafter" function: add_soundtrack(ffmpeg_location, fps, os.path.join(outdir_current, f"vid.mp4"), 0, -1, None, add_soundtrack, soundtrack_path, ffmpeg_crf, ffmpeg_preset)

justinwking commented 1 year ago

Thank you BillarySquintin that took care of the secondary issue, do you have any thoughts on what we might be able to change in order to extend the frame limit beyond 12 frames and hopefully be able to get access to the image sequence as well? Thank you again.

rookiemann commented 1 year ago

I've updated to new version and still have the problem with adding soundtrack. I had to go text2vid.py file to comment completely out the if statement at line 399. Also still only about a second of frames, the output directory only has the video and not the images. For the one second of frames that might be something in the VideoCrafter scripts I think, no?

B34STW4RS commented 1 year ago

After a quick glance before bed, I'm struggling to find where if at all the number of frames is explicitly being requested from videocrafter... but it definitely doesn't look like the torch size or whatever is being setup outside its default configuration... will look into it more later...

edit: looks like no args from the ui are being passed to videocrafter besides prompt and cfg.

*maybe steps works as well, possibly eta not sure( output is too garbage to really tell if anything changes).

jojicelf commented 1 year ago

i finded that also the resolution is not the one that you input, all the videos are in the 32 output, even tho the minimum in the ui is 64. the file from everything is requested from is in stable-diffusion-webui\extensions\sd-webui-text2video\scripts\videocrafter\base_t2v

from the line 11 to 13 you can set the video res

` image_size:

32
32`

i changed it to -64 -64 and it worked

the line 9 goes: video_length: 16

i changed it to 24, 32, 64, and doesnt work, keeps generating 16

and the line 45 temporal_length: 16

i also changed it but this one creates errors

size mismatch for model.diffusion_model.output_blocks.11.1.transformer_blocks.0.attn2_tmp.relative_position_v.embeddings_table: copying a param with shape torch.Size([xx, xx]) from checkpoint, the shape in current model is torch.Size([xx, xx]).

you can put 17 or any other number and it just goes deadass

i tried modifying only the line 9 or only the line 45 and both at the same time and it doesnt work fr

B34STW4RS commented 1 year ago

Were also behind in our script as apparently ours doesn't support variable video length? Idk what they meant by that in the original commit but their sample_text2video.py at line 41 parser.add_argument("--num_frames", type=int, default=16, help="number of input frames")

Tried working through but got stuck on noise_shape = make_model_input_shape(model, batch_size, T=num_frames) undefined error, though it is defined elsewhere and imported dunno...

pmonck commented 1 year ago

I managed to get the frames setting to work by updating to the latest scripts from the VideoCrafter repo and then editing process_videocrafter.py as follow from line 62.

    samples = sample_text2video(model, args.prompt, 1, 1,
                    sample_type="ddim", sampler=ddim_sampler,
                    ddim_steps=args.steps, eta=args.eta, 
                    cfg_scale=args.cfg_scale,
                    decode_frame_bs=1,
                    ddp=False, show_denoising_progress=False,
                    num_frames=args.frames,
                    )

Sorry, I have no github skills so I don't know how to mark this up properly.

After adding num_frame=args.frames, the slider works for setting the number of frames.

I also had to remove the args.n_prompt argument. The new sample_text2video.py script from VideoCrafter doesn't seem to have that argument (maybe I'm missing something!)

BTW, I haven't tried adding num_frames=args.frames to the original process_videocrafter.py script from this repo - maybe that works too!

B34STW4RS commented 1 year ago

This is about as far as I got this morning:

File "D:\NasD\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\t2v_helpers\render.py", line 26, in run vids_pack = process_videocrafter(args_dict) File "D:\NasD\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\videocrafter\process_videocrafter.py", line 61, in process_videocrafter samples = sample_text2video(model, args.prompt, 1, 1,# todo:add batch size support File "D:\NasD\stable-diffusion-webui\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) TypeError: sample_text2video() got multiple values for argument 'sample_type' Exception occurred: sample_text2video() got multiple values for argument 'sample_type'

Changing line 62 in the included script throws out an unexpected keyword argument 'num_frames'

Nevermind I'm an idiot: replace LVDM with new LVDM, replace scripts with new scripts, fix imports in sample_text2video.py at line 12, just copy the old ones over, at line 12, frames generation is fixed, major props to pmonck.

https://user-images.githubusercontent.com/11381013/233272918-e200894e-b994-49a2-a399-10575d1adbf8.mp4

However... the temporal consistency seems to be all over the place? Sort of like a slideshow, no matter the amount of frames or fps I generate something feels very 'off' about the generated output, it completely seems to lack the smooth motion from the basic modelscope model.

kabachuha commented 1 year ago

So you updated the base files from their repo, right?

pmonck commented 1 year ago

Yes, I grabbed the updated files from the VideoCrafter repo and made edits. It looks like that is going to be a necessary step in getting control over the number of frames (as well as all the other new features).

B34STW4RS commented 1 year ago

https://user-images.githubusercontent.com/11381013/233299151-64c48628-af02-4c52-9522-90e9595f54a6.mp4

Something still feels real off about it, been trying to mess with the configs before work, not making much headway.

Compared to original modelscope:

https://user-images.githubusercontent.com/11381013/233299805-a48cde47-7512-4a30-94be-83d58e1eabc5.mp4

pmonck commented 1 year ago

I'm seeing the same issue with the lack of smooth motion. Also, note that the new version of sample_text2video.py doesn't seem to support negative prompts in the sample_text2video() function. I just removed the argument altogether in order to get it working.

B34STW4RS commented 1 year ago

I'm actually not sure if weight are actually working as intended across the board at this point either, but it could just be the poor quality models we have to work with.

kabachuha commented 1 year ago

Updated videocrafter: now it allows variable length and you can control whether to add the soundtrack

kabachuha / sd-webui-text2video