Low IS on UCF101 - Githubissues

iva-mzsun commented 1 year ago

Hi, I find the IS score of UCF101 real videos in Table 5 is much higher than that I obtained, but I fail to find where is wrong.

I extract frames from UCF101 videos with 32 FPS, then center crop and resize each frame to 256x256 resolution. Corresponding cmd and results is:

python src/scripts/calc_metrics_for_dataset.py \
--fake_data_path datasets/UCF101/frames/trainval_32fps \
--mirror 1 --gpus 1 --resolution 256 --metrics is50k --verbose 1 --use_cache 0

python src/scripts/calc_metrics_for_dataset.py --fake_data_path datasets/UCF101/frames/trainval_32fps --mirror 1 --gpus 1 --resolution 256 --metrics is50k --verbose 1 --use_cache 0
Real data options:                                              
{'class_name': 'training.dataset.VideoFramesFolderDataset', 'path': None, 'cfg': {'max_num_frames': 10000}, 'xflip': True, 'resolution': 256, 'use_labels':
False}                                                          
Fake data options:                                              
{'class_name': 'training.dataset.VideoFramesFolderDataset', 'path': 'datasets/UCF101/f
rames/trainval_32fps', 'cfg': {'max_num_frames': 10000}, 'xflip': False, 'resolution': 256, 'use_labels': False}
Launching processes...                                          
Calculating is50k...                                            
Downloading https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metrics/inception-2015-12-05.pt ... done
dataset features    items 1024    time 1m 04s       ms/item 62.22
dataset features    items 2048    time 1m 11s       ms/item 6.88
dataset features    items 3072    time 1m 16s       ms/item 5.50
dataset features    items 4096    time 1m 22s       ms/item 5.49
dataset features    items 5120    time 1m 27s       ms/item 5.26
dataset features    items 6144    time 1m 32s       ms/item 4.61
dataset features    items 7168    time 1m 37s       ms/item 4.46
dataset features    items 8192    time 1m 43s       ms/item 5.67
dataset features    items 9216    time 1m 48s       ms/item 5.49
dataset features    items 10240   time 1m 53s       ms/item 4.38
dataset features    items 11264   time 1m 59s       ms/item 6.42                                                                                            
dataset features    items 12288   time 2m 05s       ms/item 6.00          
dataset features    items 13312   time 2m 10s       ms/item 5.05
dataset features    items 14336   time 2m 14s       ms/item 3.76
dataset features    items 15360   time 2m 19s       ms/item 4.52
dataset features    items 16384   time 2m 24s       ms/item 4.45
dataset features    items 17408   time 2m 29s       ms/item 4.91
dataset features    items 18432   time 2m 33s       ms/item 4.37
dataset features    items 19456   time 2m 38s       ms/item 5.06
dataset features    items 20480   time 2m 43s       ms/item 5.12
dataset features    items 21504   time 2m 48s       ms/item 4.27
dataset features    items 22528   time 2m 52s       ms/item 4.02
dataset features    items 23552   time 2m 57s       ms/item 4.87
dataset features    items 24576   time 3m 02s       ms/item 4.82
dataset features    items 25600   time 3m 07s       ms/item 5.42
dataset features    items 26624   time 3m 12s       ms/item 4.83
dataset features    items 27648   time 3m 18s       ms/item 5.08
dataset features    items 28672   time 3m 23s       ms/item 5.35
dataset features    items 29696   time 3m 28s       ms/item 4.51
dataset features    items 30720   time 3m 33s       ms/item 5.29
dataset features    items 31744   time 3m 38s       ms/item 4.55
dataset features    items 32768   time 3m 42s       ms/item 4.34
dataset features    items 33792   time 3m 46s       ms/item 4.09
dataset features    items 34816   time 3m 51s       ms/item 4.48
dataset features    items 35840   time 3m 55s       ms/item 4.07
dataset features    items 36864   time 4m 00s       ms/item 4.86
dataset features    items 37888   time 4m 05s       ms/item 4.76
dataset features    items 38912   time 4m 10s       ms/item 5.04
dataset features    items 39936   time 4m 15s       ms/item 4.49
dataset features    items 40960   time 4m 19s       ms/item 4.47
dataset features    items 41984   time 4m 23s       ms/item 3.90
dataset features    items 43008   time 4m 28s       ms/item 4.29
dataset features    items 44032   time 4m 32s       ms/item 4.49
dataset features    items 45056   time 4m 37s       ms/item 4.25
dataset features    items 46080   time 4m 42s       ms/item 5.23
dataset features    items 47104   time 4m 46s       ms/item 4.28
dataset features    items 48128   time 4m 51s       ms/item 4.41
dataset features    items 49152   time 4m 55s       ms/item 3.67
dataset features    items 50000   time 4m 59s       ms/item 5.52
{"results": {"is50k_mean": 2.733875036239624, "is50k_std": 0.36327630281448364}, "metric": "is50k", "total_time": 414.2170798778534, "total_time_str": "6m 5
4s", "num_gpus": 1, "snapshot_pkl": null, "timestamp": 1665990497.9141998}

iva-mzsun commented 1 year ago

When using isv, it is still far lower than expectation.

python src/scripts/calc_metrics_for_dataset.py \
--fake_data_path datasets/UCF101/frames/trainval_32fps \
--mirror 1 --gpus 1 --resolution 256 --metrics isv2048_ucf --verbose 1 --use_cache 0

Real data options:                                                                                                                     
{'class_name': 'training.dataset.VideoFramesFolderDataset', 'path': None, 'cfg': {'max_num_frames': 10000}, 'xflip': True, 'resolution'
: 256, 'use_labels': False}                                                                                                           
Fake data options:                                                                                                                    
{'class_name': 'training.dataset.VideoFramesFolderDataset', 'path': 'datasets/UCF101/frames/trainval_32fps', 'cfg': {'max_num_frames': 10000}, 'xflip': False, 'resolution': 256, 'use_labels': False} 
Launching processes...                                                                                                                
Calculating isv2048_ucf...                                                                                                                                                                        
dataset features    items 1024    time 2m 27s       ms/item 143.37                                                                    
dataset features    items 2048    time 4m 10s       ms/item 100.84                                                                    {"results": {"isv2048_ucf_mean": 16.65230369567871, "isv2048_ucf_std": 0.7203830480575562}, "metric": "isv2048_ucf", "total_time": 379.
13585901260376, "total_time_str": "6m 19s", "num_gpus": 1, "snapshot_pkl": null, "timestamp": 1665998206.2202895}

liangbingzhao commented 1 year ago

I got same issue. I used isv and got mean:16.668, std: 0.4938

liangbingzhao commented 1 year ago

@anonymous202203 Hi, have you figured it out?

iva-mzsun commented 1 year ago

Not with this repository. I adopt another implementation of FVD (https://github.com/pfnet-research/tgan2), which obtains a reasonable IS score for me.

liangbingzhao commented 1 year ago

@anonymous202203 do you have its pretrained model on ucf101? Their implementation is a little bit confused if without pretrained model config

liangbingzhao commented 1 year ago

@anonymous202203 can you share your parameters of tgan2? I test it on the original UCF-101, the IS only got around 30, which is supposed to be 60.

universome commented 1 year ago

Hi @anonymous202203 , ok, this is serious. Do you think it is possible to share your version of the UCF dataset with me (my email is iskorokhodov@gmail.com)? Our ISV implementation should be identical to the one from TGANv2 and I was checking all the activations to verify that it's indeed true. In our case, it was giving scores of ~90 for real data as far as I remember (UPD: yeah, I just took a look at Table 5, it is 97).

Also, IS which you measured in your first comment is an image-based metric using an ImageNet-pretrained model, so it's not a surprise that it shows low values.

P.S. I apologize for not responding in time

universome commented 1 year ago

@anonymous202203 @martinriven

Ah, I think I might understand the issue: since you pretend that you evaluate on fake data, you are using just 2048 videos out of 10-11k ones, this is why a lot of classes are ommited, which makes IS being very unhappy. You should change num_gen argument here to be (at least approximately) equal to the number of images in your UCF. Otherwise, many classes are not covered an IS is low.

We used just 2048 videos for fake data to be comparable with prior work. Also, if the classes are randomly (and thus evenly) distributed in those 2048 videos, then Inception Score is not that bad. But when you use fake data, then just the first 2048 videos are taken from the dataset because the dataloader is run with with shuffle=False during evaluation.

liangbingzhao commented 1 year ago

@universome yeah, you are right. when setting to 13320, it got almost 60, thx.

universome commented 1 year ago

@martinriven 60 is still too low. I've just recomputed the metric with num_gen=13320 and got ISV=84.13. I suspect that there could be an issue of how you pre-process the UCF dataset.

liangbingzhao commented 1 year ago

@universome Well, from LDVD GAN, if set resolution to 128, the IS should be around 80-90. Why u got 84 when set resolution to 256? Shouldn't it be bigger? b0f642a63194187185d05655e38a2ef

universome commented 1 year ago

@martinriven the underlying C3D model resizes all the input videos to the 112x112 resolution, so it would be producing almost identical results for anything higher than 112x112 (depending on the downsampling scheme you use)

liangbingzhao commented 1 year ago

@universome you are right, there should be no difference when resolution higher than 112. I wonder how you process the data? I chose central 32 frame of the each video(due to the training scheme), center crop and resize to 128 and 256 resolution. I could only get IS of 60.

universome commented 1 year ago

@liangbingzhao we use the full videos during training. For the metric calculation above, since it is assumed that those videos are generated by the generator (i.e. we pass the real data via fake_data_path), they are assumed to be just 16-frames long and we simply extract the first 16 frames from each video. When I tried running the above script with extracting random 16 consecutive frames from each video, ISV was ~85. Do you store videos as JPEG images (and if so, which JPEG quality did you use while converting MP4 into JPEG)?

liangbingzhao commented 1 year ago

I first crop UCF to 240*240, and store as MP4. Then I use your script to convert UCF videos to JPEG. I tried store 128 and 256 images, both only got ISV 60.

universome commented 1 year ago

@liangbingzhao As far as I remember, we simply downloaded the original UCF, videos and then preprocessed into a collection of JPG images with our script. Can it be the case that you accidentally decreased the video quality (e.g., by using a too severe compression) while converting to MP4?

universome / stylegan-v

Low IS on UCF101 #26