showlab / Tune-A-Video

[ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
https://tuneavideo.github.io
Apache License 2.0
4.15k stars 377 forks source link

what's the meaning of 'n_sample_frames' and 'video_length', should they be same? #58

Closed overfiter closed 1 year ago

overfiter commented 1 year ago

to save varm, i set n_sample_frames=1, and video_length=12. (i don't know what they mean) and i got an error:

File "E:\repo\Tune-A-Video\train_tuneavideo.py", line 374, in main(*OmegaConf.load(args.config)) File "E:\repo\Tune-A-Video\train_tuneavideo.py", line 339, in main sample = validation_pipeline(prompt, generator=generator, latents=ddim_inv_latent, File "C:\Users\coreyzhong\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "E:\repo\Tune-A-Video\tuneavideo\pipelines\pipeline_tuneavideo.py", line 356, in call latents = self.prepare_latents( File "E:\repo\Tune-A-Video\tuneavideo\pipelines\pipeline_tuneavideo.py", line 303, in prepare_latents raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}") ValueError: Unexpected latents shape, got torch.Size([1, 4, 1, 64, 64]), expected (1, 4, 12, 64, 64)

zhangjiewu commented 1 year ago

n_sample_frames is the number of frames used for training the model video_length is the number of frames used for inference (i.e., generating new videos)

here, they should be the same.

overfiter commented 1 year ago

tks for your reply! on my rtb3060(12g), with xformers, i just can use 3 frames to train and inference(each more 1 frame need about 1g vram). Is large vram necessary for generaging long video?

zhangjiewu commented 1 year ago

i think 12GB vram can do 8-frame video with xformers. this colab demo runs 8-frame video on a Tesla T4 (15GB). you may double check if your xformers is working. simply adding more frames to a video will not result in a proportional increase in varm, a V100 (24GB) can process 32-frame videos.