mever-team / distill-and-select

Authors official PyTorch implementation of the "DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval" [IJCV 2022]
Apache License 2.0
64 stars 9 forks source link

[Questions about code evaluation results] #7

Closed glee1228 closed 2 years ago

glee1228 commented 2 years ago

Hi, thanks for your work!

The results I obtained and the evaluation results of the paper are different, but I don't know which part I'm missing.

And, Could you share FIVR-200K, DNS-100K Original Videos with me?

My Training execution code is as follows.

python train_student.py --student_type coarse-grained --experiment_path experiments/dns_students --trainset_hdf5 /mldisk/nfs_shared_/dh/datasets/dns_100k.hdf5

My evaluation execution code is as follows.

python evaluation_student.py --student_path /workspace/distill-and-select/experiments/dns_students/model_cg_student.pth --dataset FIVR-5K --dataset_hdf5 /mldisk/nfs_shared_/dh/datasets/fivr_200k.hdf5

As for the parameters of training and evaluation codes, the code provided was used as it was.

The performance evaluation of the coarse-grained student model in the paper is

===== FIVR-5K Dataset =====
Queries: 50 videos
Database: 5000 videos
----------------
DSVR mAP: 0.634
CSVR mAP: 0.647
ISVR mAP: 0.608

The performance evaluation of my coarse-grained student model

===== FIVR-5K Dataset =====
Queries: 50 videos
Database: 5000 videos
----------------
DSVR mAP: 0.5735
CSVR mAP: 0.5920
ISVR mAP: 0.5579

The text below is the result of my execution.

python train_student.py --student_type coarse-grained --experiment_path experiments/dns_students --trainset_hdf5 /mldisk/nfs_shared_/dh/datasets/dns_100k.hdf5

epoch 19: 100%|█| 688/688 [05:20<00:00,  2.14iter/s, total_loss=0.090 (0.064), distillat
epoch 20: 100%|█| 688/688 [05:22<00:00,  2.14iter/s, total_loss=0.067 (0.064), distillat
epoch 21: 100%|█| 688/688 [05:20<00:00,  2.14iter/s, total_loss=0.070 (0.064), distillat
epoch 22: 100%|█| 688/688 [05:22<00:00,  2.14iter/s, total_loss=0.066 (0.063), distillat
epoch 23: 100%|█| 688/688 [05:23<00:00,  2.13iter/s, total_loss=0.066 (0.063), distillat
epoch 24: 100%|█| 688/688 [05:20<00:00,  2.14iter/s, total_loss=0.050 (0.062), distillat
epoch 25: 100%|█| 688/688 [05:22<00:00,  2.13iter/s, total_loss=0.058 (0.061), distillat
epoch 26: 100%|█| 688/688 [05:23<00:00,  2.13iter/s, total_loss=0.067 (0.061), distillat
epoch 27: 100%|█| 688/688 [05:23<00:00,  2.13iter/s, total_loss=0.054 (0.061), distillat
epoch 28: 100%|█| 688/688 [05:23<00:00,  2.13iter/s, total_loss=0.062 (0.061), distillat
epoch 29: 100%|█| 688/688 [05:24<00:00,  2.12iter/s, total_loss=0.059 (0.060), distillat
epoch 30: 100%|█| 688/688 [05:23<00:00,  2.12iter/s, total_loss=0.086 (0.060), distillat
epoch 31: 100%|█| 688/688 [05:19<00:00,  2.15iter/s, total_loss=0.056 (0.060), distillat
epoch 32: 100%|█| 688/688 [05:21<00:00,  2.14iter/s, total_loss=0.056 (0.059), distillat
epoch 33: 100%|█| 688/688 [05:22<00:00,  2.13iter/s, total_loss=0.058 (0.059), distillat
epoch 34: 100%|█| 688/688 [05:24<00:00,  2.12iter/s, total_loss=0.075 (0.059), distillat
epoch 35: 100%|█| 688/688 [05:24<00:00,  2.12iter/s, total_loss=0.063 (0.059), distillat

...

epoch 250: 100%|█| 688/688 [05:20<00:00,  2.15iter/s, total_loss=0.037 (0.044), distill
epoch 251: 100%|█| 688/688 [05:19<00:00,  2.15iter/s, total_loss=0.045 (0.044), distill
epoch 252: 100%|█| 688/688 [05:21<00:00,  2.14iter/s, total_loss=0.038 (0.044), distill
epoch 253: 100%|█| 688/688 [05:23<00:00,  2.13iter/s, total_loss=0.038 (0.044), distill
epoch 254: 100%|█| 688/688 [05:23<00:00,  2.13iter/s, total_loss=0.069 (0.044), distill
epoch 255: 100%|█| 688/688 [05:19<00:00,  2.15iter/s, total_loss=0.052 (0.044), distill
epoch 256: 100%|█| 688/688 [05:22<00:00,  2.14iter/s, total_loss=0.041 (0.044), distill
epoch 257: 100%|█| 688/688 [05:22<00:00,  2.13iter/s, total_loss=0.043 (0.044), distill
epoch 258: 100%|█| 688/688 [05:20<00:00,  2.14iter/s, total_loss=0.071 (0.044), distill
epoch 259: 100%|█| 688/688 [05:18<00:00,  2.16iter/s, total_loss=0.055 (0.044), distill
epoch 260: 100%|█| 688/688 [05:21<00:00,  2.14iter/s, total_loss=0.049 (0.044), distill
epoch 261: 100%|█| 688/688 [05:21<00:00,  2.14iter/s, total_loss=0.037 (0.043), distill
epoch 262: 100%|█| 688/688 [05:20<00:00,  2.15iter/s, total_loss=0.047 (0.044), distill
epoch 263: 100%|█| 688/688 [05:20<00:00,  2.15iter/s, total_loss=0.055 (0.044), distill
epoch 264: 100%|█| 688/688 [05:21<00:00,  2.14iter/s, total_loss=0.034 (0.043), distill
epoch 265: 100%|█| 688/688 [05:18<00:00,  2.16iter/s, total_loss=0.042 (0.044), distill
epoch 266: 100%|█| 688/688 [05:19<00:00,  2.16iter/s, total_loss=0.035 (0.044), distill
epoch 267: 100%|█| 688/688 [05:20<00:00,  2.15iter/s, total_loss=0.041 (0.043), distill
epoch 268: 100%|█| 688/688 [05:20<00:00,  2.15iter/s, total_loss=0.040 (0.044), distill
epoch 269: 100%|█| 688/688 [05:20<00:00,  2.14iter/s, total_loss=0.051 (0.044), distill
epoch 270: 100%|█| 688/688 [05:21<00:00,  2.14iter/s, total_loss=0.050 (0.044), distill
epoch 271: 100%|█| 688/688 [05:17<00:00,  2.17iter/s, total_loss=0.039 (0.044), distill
epoch 272: 100%|█| 688/688 [05:18<00:00,  2.16iter/s, total_loss=0.038 (0.043), distill
epoch 273: 100%|█| 688/688 [05:16<00:00,  2.17iter/s, total_loss=0.041 (0.043), distill
epoch 274: 100%|█| 688/688 [05:17<00:00,  2.16iter/s, total_loss=0.034 (0.043), distill
epoch 275: 100%|█| 688/688 [05:16<00:00,  2.18iter/s, total_loss=0.056 (0.043), distill
epoch 276: 100%|█| 688/688 [05:17<00:00,  2.17iter/s, total_loss=0.040 (0.043), distill
epoch 277: 100%|█| 688/688 [05:17<00:00,  2.17iter/s, total_loss=0.051 (0.043), distill
epoch 278: 100%|█| 688/688 [05:19<00:00,  2.15iter/s, total_loss=0.038 (0.043), distill
epoch 279: 100%|█| 688/688 [05:19<00:00,  2.15iter/s, total_loss=0.028 (0.043), distill
epoch 280: 100%|█| 688/688 [05:19<00:00,  2.15iter/s, total_loss=0.039 (0.043), distill
epoch 281: 100%|█| 688/688 [05:21<00:00,  2.14iter/s, total_loss=0.035 (0.043), distill
epoch 282: 100%|█| 688/688 [05:21<00:00,  2.14iter/s, total_loss=0.033 (0.043), distill
epoch 283: 100%|█| 688/688 [05:21<00:00,  2.14iter/s, total_loss=0.039 (0.043), distill
epoch 284: 100%|█| 688/688 [05:22<00:00,  2.14iter/s, total_loss=0.034 (0.043), distill
epoch 285: 100%|█| 688/688 [05:18<00:00,  2.16iter/s, total_loss=0.041 (0.043), distill
epoch 286: 100%|█| 688/688 [05:19<00:00,  2.15iter/s, total_loss=0.058 (0.043), distill
epoch 287: 100%|█| 688/688 [05:20<00:00,  2.15iter/s, total_loss=0.038 (0.043), distill
epoch 288: 100%|█| 688/688 [05:21<00:00,  2.14iter/s, total_loss=0.036 (0.043), distill
epoch 289: 100%|█| 688/688 [05:17<00:00,  2.17iter/s, total_loss=0.061 (0.043), distill
epoch 290: 100%|█| 688/688 [05:19<00:00,  2.16iter/s, total_loss=0.037 (0.043), distill
epoch 291: 100%|█| 688/688 [05:20<00:00,  2.15iter/s, total_loss=0.041 (0.043), distill
epoch 292: 100%|█| 688/688 [05:20<00:00,  2.15iter/s, total_loss=0.039 (0.043), distill
epoch 293: 100%|█| 688/688 [05:20<00:00,  2.15iter/s, total_loss=0.048 (0.043), distill
epoch 294: 100%|█| 688/688 [05:18<00:00,  2.16iter/s, total_loss=0.025 (0.043), distill
epoch 295: 100%|█| 688/688 [05:19<00:00,  2.15iter/s, total_loss=0.043 (0.043), distill
epoch 296: 100%|█| 688/688 [05:18<00:00,  2.16iter/s, total_loss=0.039 (0.043), distill
epoch 297: 100%|█| 688/688 [05:20<00:00,  2.15iter/s, total_loss=0.056 (0.043), distill
epoch 298: 100%|█| 688/688 [05:20<00:00,  2.15iter/s, total_loss=0.036 (0.043), distill
epoch 299: 100%|█| 688/688 [05:21<00:00,  2.14iter/s, total_loss=0.042 (0.043), distill

root@5b101930459a:/workspace/distill-and-select# python evaluation_student.py --student_path experiments/DnS_students/model_fg_att_student.pth --dataset FIVR-5K --dataset_hdf5 /mldisk/nfs_shared_/dh/datasets/fivr_200k.hdf5

> Loading network
CoarseGrainedStudent(
  (transformer): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (netvlad): NetVLAD(
    (conv): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (reduction_layer): Linear(in_features=32768, out_features=1024, bias=False)
    (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
)

> Extract features of the query videos
100%|████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.34s/it]

> Extract features of the target videos
100%|████████████████████████████████████████████████| 157/157 [01:57<00:00,  1.34it/s]

> Calculate query-target similarities

> Evaluation on FIVR
===== FIVR-5K Dataset =====
Queries: 50 videos
Database: 5000 videos
----------------
DSVR mAP: 0.5735
CSVR mAP: 0.5920
ISVR mAP: 0.5579
gkordo commented 2 years ago

Hi @glee1228. Thank you very much for your feedback. The major discrepancy between your run and ours that I can spot is that we used a learning rate of 1e-5 for the training of the coarse-grained student. However, in your case, you used the default value (i.e. 1e-4, used for the training of the fine-grained student). According to our empirical findings, this has a significant impact on the performance; therefore, this might be the reason for the large performance difference. I have also modified the README.md for the use of the correct learning rate value.

Hence, could you please train the network with 1e-5 learning rate? If there is still such significant performance difference, please let me know to look further into it.

gkordo commented 2 years ago

Hi again. I have revisited this issue on a clean install on a new machine. A couple more things were different than our run generated the provided pretrained weights. More precisely, the Attention layer was activated with --attention true, and the teacher used for training was the fg_att_student_iter2. To this end, running the following script:

python train_student.py --student_type coarse-grained --learning_rate 1e-5 --attention true --teacher fg_att_student_iter2 --experiment_path ~/experiments/DnS_method --trainset_hdf5 ~/features/dns_100k.hdf5

it achieved the following results:

===== FIVR-5K Dataset =====                                                                                                                                                                                        
Queries: 50 videos                                                                                                                                                                                                 
Database: 5000 videos                                                                                                                                                                                              
----------------                                                                                                                                                                                                   
DSVR mAP: 0.6364                                                                                                                                                                                                   
CSVR mAP: 0.6482                                                                                                                                                                                                   
ISVR mAP: 0.6062

Also, since the performance on FIVR-5K is quite volatile, I have added a validation option to the training script based on this dataset. See the last bullet in the corresponding README.md section for more details.