voletiv / mcvd-pytorch

Official implementation of MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation (https://arxiv.org/abs/2205.09853)
MIT License
330 stars 26 forks source link

UCF101 Unconditional Generation FVD Result (16 frames vs 20 frames) #26

Closed JunyaoHu closed 9 months ago

JunyaoHu commented 9 months ago

Hello. I want to confirm the calculation method of unconditional generation FVD. . In your paper, you generate 16 frames.

image

https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/example_scripts/final/sampling_scripts.sh#L278-L286

And you calculate FVD between the 16-frame predicted result and the 20-frame origin video, right?

for 20-frame origin video

https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/runners/ncsn_runner.py#L1927-L1932

https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/runners/ncsn_runner.py#L1939-L1940

https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/runners/ncsn_runner.py#L1974-L1977

for 16-frame predicted result

https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/runners/ncsn_runner.py#L1915-L1916

https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/runners/ncsn_runner.py#L1979-L1982

calculate unconditional FVD result

https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/runners/ncsn_runner.py#L2264-L2269

AlexiaJM commented 9 months ago

Hi Junyao,

No, we still use 16 frames for real data.

See https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/runners/ncsn_runner.py#L1458 and https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/runners/ncsn_runner.py#L115C28-L115C29.

JunyaoHu commented 9 months ago

@AlexiaJM Hello,

So, you calculate unconditional FVD between the 20-frame predicted result (pred20) and the 20-frame origin video (cond4+real16), right?


when I use your setting to do inference,

https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/example_scripts/final/sampling_scripts.sh#L278-L286

I only run this shell.

https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/example_scripts/final/sampling_scripts.sh#L292-L296

It will do the prediction task and generation task.

https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/runners/ncsn_runner.py#L1404-L1405

In the video prediction task, FVD is calculated on (cond4+real16) and (cond4+pred16). Do pred16/4=4 time autoregressions. And my perception is consistent. In the video generation task, FVD is calculated on (cond4+real16) and (pred20). Do pred20/4=5 time autoregressions.

image

my config output is as follows:

(ps: I only edit sampling.preds_per_test=1, sampling.subsample=5 for getting results faster)

(EDM) ubuntu@ubuntu:~/zzc/code/mcvd-pytorch$ sh /home/ubuntu/zzc/code/mcvd-pytorch/example_scripts/final/sampling_scripts.sh
INFO - main.py - 2024-01-26 03:18:55,408 - Using device: cuda
INFO - main.py - 2024-01-26 03:18:55,409 - Namespace(config='configs/ucf101.yml', data_path='/home/ubuntu/zzc/data/video_prediction/UCF101/UCF101_h5', seed=1234, exp='/home/ubuntu/zzc/code/mcvd-pytorch/checkpoints/ucf10132_big288_4c4_pmask50_unetm', comment='', verbose='info', resume_training=False, test=False, feats_dir='/home/
ubuntu/zzc/code/mcvd-pytorch/datasets', stats_dir='/home/ubuntu/zzc/code/mcvd-pytorch/datasets', stats_download=False, fast_fid=False, fid_batch_size=1000, no_pr=False, fid_num_samples=None, pr_nn_k=None, sample=False, image_folder='images', final_only=True, end_ckpt=None, freq=None, no_ema=False, ni=True, interact=False, video_
gen=True, video_folder='/home/ubuntu/zzc/code/mcvd-pytorch/checkpoints/ucf10132_big288_4c4_pmask50_unetm/video_samples/videos_900000_DDPM_100_nfp_16', subsample=None, ckpt=900000, config_mod=['data.prob_mask_cond=0.50', 'model.ngf=288', 'model.n_head_channels=288', 'data.num_frames=4', 'data.num_frames_cond=4', 'training.batch_s
ize=32', 'sampling.batch_size=60', 'sampling.max_data_iter=1000', 'model.arch=unetmore', 'sampling.num_frames_pred=16', 'sampling.preds_per_test=1', 'sampling.subsample=5', 'model.version=DDPM'], start_at=0, command='python main.py --config configs/ucf101.yml --data_path /home/ubuntu/zzc/data/video_prediction/UCF101/UCF101_h5 --
exp /home/ubuntu/zzc/code/mcvd-pytorch/checkpoints/ucf10132_big288_4c4_pmask50_unetm --ni --config_mod data.prob_mask_cond=0.50 model.ngf=288 model.n_head_channels=288 data.num_frames=4 data.num_frames_cond=4 training.batch_size=32 sampling.batch_size=60 sampling.max_data_iter=1000 model.arch=unetmore sampling.num_frames_pred=16
 sampling.preds_per_test=1 sampling.subsample=5 model.version=DDPM --ckpt 900000 --video_gen -v videos_900000_DDPM_100_nfp_16', log_path='/home/ubuntu/zzc/code/mcvd-pytorch/checkpoints/ucf10132_big288_4c4_pmask50_unetm/logs')
INFO - main.py - 2024-01-26 03:18:55,410 - Writing log file to /home/ubuntu/zzc/code/mcvd-pytorch/checkpoints/ucf10132_big288_4c4_pmask50_unetm/logs
INFO - main.py - 2024-01-26 03:18:55,410 - Exp instance id = 36017
INFO - main.py - 2024-01-26 03:18:55,410 - Exp comment = 
INFO - main.py - 2024-01-26 03:18:55,410 - Config =
...

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
INFO - main.py - 2024-01-26 03:18:55,419 - Args =
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
ckpt: 900000
command: python main.py --config configs/ucf101.yml --data_path /home/ubuntu/zzc/data/video_prediction/UCF101/UCF101_h5
  --exp /home/ubuntu/zzc/code/mcvd-pytorch/checkpoints/ucf10132_big288_4c4_pmask50_unetm
  --ni --config_mod data.prob_mask_cond=0.50 model.ngf=288 model.n_head_channels=288
  data.num_frames=4 data.num_frames_cond=4 training.batch_size=32 sampling.batch_size=60
  sampling.max_data_iter=1000 model.arch=unetmore sampling.num_frames_pred=16 sampling.preds_per_test=1
  sampling.subsample=5 model.version=DDPM --ckpt 900000 --video_gen -v videos_900000_DDPM_100_nfp_16
comment: ''
config: configs/ucf101.yml
config_mod:
- data.prob_mask_cond=0.50
- model.ngf=288
- model.n_head_channels=288
- data.num_frames=4
- data.num_frames_cond=4
- training.batch_size=32
- sampling.batch_size=60
- sampling.max_data_iter=1000
- model.arch=unetmore
- sampling.num_frames_pred=16
- sampling.preds_per_test=1
- sampling.subsample=5
- model.version=DDPM
...

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
INFO - ncsn_runner.py - 2024-01-26 03:18:59,159 - Loading ckpt /home/ubuntu/zzc/code/mcvd-pytorch/checkpoints/ucf10132_big288_4c4_pmask50_unetm/logs/checkpoint_900000.pt
Checking shard_lengths in ['/home/ubuntu/zzc/data/video_prediction/UCF101/UCF101_h5/shard_0001.hdf5']
h5: Opening /home/ubuntu/zzc/data/video_prediction/UCF101/UCF101_h5/shard_0001.hdf5... h5: paths 1 ; shard_lengths [13320] ; total 13320
Dataset length: 9624
Checking shard_lengths in ['/home/ubuntu/zzc/data/video_prediction/UCF101/UCF101_h5/shard_0001.hdf5']
h5: Opening /home/ubuntu/zzc/data/video_prediction/UCF101/UCF101_h5/shard_0001.hdf5... h5: paths 1 ; shard_lengths [13320] ; total 13320
Dataset length: 256
Setting up Perceptual loss...
/home/ubuntu/anaconda3/envs/EDM/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/ubuntu/anaconda3/envs/EDM/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use 
`weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Loading model from: /home/ubuntu/zzc/code/mcvd-pytorch/models/weights/v0.1/alex.pth
...[net-lin [alex]] initialized
...Done

video_gen dataloader:   0%|                                                                                                                                                                                                                                                                                        | 0/5 [00:00<?, ?it/s]I
NFO - ncsn_runner.py - 2024-01-26 03:19:52,738 - (1) Video Pred
INFO - ncsn_runner.py - 2024-01-26 03:19:52,739 - PREDICTING 16 frames, using a 4 frame model conditioned on 4 frames, subsample=5, preds_per_test=1
                                                                                                                                                                                                                                                                                                                                         D
DPM: 1/5, grad_norm: 221.89865112304688, image_norm: 35.91960144042969, grad_mean_norm: 815.6091918945312                                                                                                                                                                                                           | 0/4 [00:00<?, ?it/s]
INFO - __init__.py - 2024-01-26 03:20:06,223 - DDPM: 1/5, grad_norm: 221.89865112304688, image_norm: 35.91960144042969, grad_mean_norm: 815.6091918945312

...

DDPM: 5/5, grad_norm: 378.7744445800781, image_norm: 79.4982681274414, grad_mean_norm: 815.8571166992188
INFO - __init__.py - 2024-01-26 03:20:21,519 - DDPM: 5/5, grad_norm: 378.7744445800781, image_norm: 79.4982681274414, grad_mean_norm: 815.8571166992188
Generating video frames: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:29<00:00,  7.30s/it]
INFO - ncsn_runner.py - 2024-01-26 03:27:10,659 - fvd1 True, fvd2 False, fvd3 True██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:29<00:00,  5.87s/it]
INFO - ncsn_runner.py - 2024-01-26 03:27:10,660 - (3) Video Gen - Uncond - FVD
INFO - ncsn_runner.py - 2024-01-26 03:27:10,660 - GENERATING (Uncond) 20 frames, using a 4 frame model (conditioned on 4 cond + 0 futr frames), subsample=5, preds_per_test=1
                                                                                                                                                                                                                                                                                                                                         DDPM: 1/5, grad_norm: 221.8052520751953, image_norm: 35.507598876953125, grad_mean_norm: 817.7996826171875                                                                                                                                                                                                           | 0/5 [00:00<?, ?it/s]
INFO - __init__.py - 2024-01-26 03:27:11,371 - DDPM: 1/5, grad_norm: 221.8052520751953, image_norm: 35.507598876953125, grad_mean_norm: 817.7996826171875
DDPM: 2/5, grad_norm: 221.9393310546875, image_norm: 53.967655181884766, grad_mean_norm: 810.0634155273438
AlexiaJM commented 9 months ago

Yes, you have it right.

JunyaoHu commented 9 months ago

Very thanks, it helps me a lot!