okankop / Efficient-3DCNNs

PyTorch Implementation of "Resource Efficient 3D Convolutional Neural Networks", codes and pretrained models.
MIT License
773 stars 149 forks source link

When we talk about accuracy, it means top-1 accuracy or top-5 accuracy? #32

Closed fantasysee closed 3 years ago

fantasysee commented 3 years ago

Hi @okankop ,

Thanks very much for sharing such a wonderful repo!

I am a little bit confused about the metric "video classification accuracy" in your paper. I don't know it means top-1 accuracy or top-5 accuracy.

The confusion comes from my experiment results based on your repo.

Results on model MobileNetV1 with UCF-101 datasets

Using the pre-trained model on Kinetics-600: Top1: 52.26%, Top5: 78.95%, Reported in your paper: 70.95% 

{"modality": "RGB", "dataset": "ucf101", "n_classes": 600, "n_finetune_classes": 101, "sample_size": 112, "sample_duration": 16, "downsample": 2, "initial_scale": 1.0, "n_scales": 5, "scale_step": 0.84089641525, "train_crop": "random", "learning_rate": 0.1, "lr_steps": [40, 55, 65, 70, 200, 250], "momentum": 0.9, "dampening": 0.9, "weight_decay": 0.001, "mean_dataset": "activitynet", "no_mean_norm": false, "std_norm": false, "nesterov": false, "optimizer": "sgd", "lr_patience": 10, "batch_size": 64, "n_epochs": 250, "begin_epoch": 1, "n_val_samples": 1, "resume_path": "", "pretrain_path": "~/Documents/proj_3dcnn/Efficient-3DCNNs/Pretrained-Models/kinetics_mobilenet_1.0x_RGB_16_best.pth", "ft_portion": "last_layer", "no_train": false, "no_val": false, "test": false, "test_subset": "val", "scale_in_test": 1.0, "crop_position_in_test": "c", "no_softmax_in_test": false, "no_cuda": false, "n_threads": 16, "checkpoint": 1, "no_hflip": false, "norm_value": 1, "model": "mobilenet", "version": 1.1, "model_depth": 18, "resnet_shortcut": "B", "wide_resnet_k": 2, "resnext_cardinality": 32, "groups": 3, "width_mult": 1.0, "manual_seed": 1, "scales": [1.0, 0.84089641525, 0.7071067811803005, 0.5946035574934808, 0.4999999999911653], "arch": "mobilenet", "mean": [114.7748, 107.7354, 99.475], "std": [38.7568578, 37.88248729, 40.02898126]}

Training-from-scratch: Top1: 38.51%, Top5: 64.02%

{"modality": "RGB", "dataset": "ucf101", "n_classes": 101, "n_finetune_classes": 400, "sample_size": 112, "sample_duration": 16, "downsample": 2, "initial_scale": 1.0, "n_scales": 5, "scale_step": 0.84089641525, "train_crop": "random", "learning_rate": 0.1, "lr_steps": [40, 55, 65, 70, 200, 250], "momentum": 0.9, "dampening": 0.9, "weight_decay": 0.001, "mean_dataset": "activitynet", "no_mean_norm": false, "std_norm": false, "nesterov": false, "optimizer": "sgd", "lr_patience": 10, "batch_size": 64, "n_epochs": 250, "begin_epoch": 1, "n_val_samples": 1, "resume_path": "", "pretrain_path": "", "ft_portion": "complete", "no_train": false, "no_val": false, "test": false, "test_subset": "val", "scale_in_test": 1.0, "crop_position_in_test": "c", "no_softmax_in_test": false, "no_cuda": false, "n_threads": 16, "checkpoint": 1, "no_hflip": false, "norm_value": 1, "model": "mobilenet", "version": 1.1, "model_depth": 18, "resnet_shortcut": "B", "wide_resnet_k": 2, "resnext_cardinality": 32, "groups": 3, "width_mult": 1.0, "manual_seed": 1, "scales": [1.0, 0.84089641525, 0.7071067811803005, 0.5946035574934808, 0.4999999999911653], "arch": "mobilenet", "mean": [114.7748, 107.7354, 99.475], "std": [38.7568578, 37.88248729, 40.02898126]}

fantasysee commented 3 years ago

Previously I thought the accuracy means Top5 accuracy, while there is a huge gap between my result following training methods in your paper, and the result reported in the paper.

If there is any training step I missed, please correct me.

Or maybe the reason why I received superior Top5 accuracy over yours in the paper, is that I trained 250 more epochs based on the pre-trained model?

Look forward to your reply! Thanks in advance.

okankop commented 3 years ago

Hi @fantasysee, "video classification accuracy" refers to Top-1 accuracy. Most probably you are getting clip accuracy, not video accuracy. Once your training is finished, you need to extra calculate video accuracy. For the calculation of video accuracy, scores for the non-overlapping consecutive clips are averaged. Check out the "Calculating Video Accuracy" part of the README.

fantasysee commented 3 years ago

Hi @okankop . Thank you very much for your warm and in-time reply!

I have followed the "Calculating Video Accuracy" part of the README, and calculate the video accuracy.

Nevertheless, the accuracy I measured is much lower than that in your paper.

Using the opt listed above, the pre-trained and finetuned one achieve 51.84%, while it is reported 70.95% in your paper. And the training-from-scratch one achieves 39.55%.

Would you please tell me what may cause the drop in the accuracy? Is there any other step I missed again? ;(

okankop commented 3 years ago

From clip accuracy to video accuracy usually the score increases around 15-20% on UCF dataset. Did you observe the same increase? If you cannot observe this increase, maybe video accuracy calculation is wrong.

fantasysee commented 3 years ago

No. I observe similar video accuracy compared with the clip accuracy ;( Thank you very much for your tip! It helps a lot!!!