sjenni / temporal-ssl

Video Representation Learning by Recognizing Temporal Transformations. In ECCV, 2020.
https://sjenni.github.io/temporal-ssl/
GNU General Public License v3.0
48 stars 1 forks source link

Cannot reproduce linear evaluation performance on UCF-101 #4

Closed KT27-A closed 3 years ago

KT27-A commented 3 years ago

Dear friend, thank you very much for your work, I really learned a lot from it. It is impressive that after training only on speed prediction and 50 epoch, it got 49.3% acc on UCF-101 with linear evaluation. Nowadays, I have been trying to reproduce this performance on Pytorch following your code. But I just got 15% acc on UCF-101 with linear evaluation. Could you please give me some advice on how to achieve the performance? I have checked a lot of times that I followed your code and I may neglect some important things. Thank you very much.

sjenni commented 3 years ago

Hi, It is difficult to tell what the issue is since there are many factors involved. Did you check what the performance of supervised pre-training on UCF gets (approx 60% in my case)? Another test would be to see what random initialization achieves... How is the performance on the pre-training task?

KT27-A commented 3 years ago

Thank you very much for your quick response. I trained from scratch and got 55% acc on UCF101, which seems OK. Would you please tell me the details of random initialization on Conv and FC layers? Random initialization on FC layer is more important? I used Normal distribution with (0, 1/np.sqrt(NUMBER of features)) on FC layer.

KT27-A commented 3 years ago

Also, could you please tell me what the accuracy you got when only training the speed prediction? I got ~57% for it. Thanks.

sjenni commented 3 years ago

Thank you very much for your quick response. I trained from scratch and got 55% acc on UCF101, which seems OK. Would you please tell me the details of random initialization on Conv and FC layers? Random initialization on FC layer is more important? I used Normal distribution with (0, 1/np.sqrt(NUMBER of features)) on FC layer.

Hi, 55% with supervised sounds reasonable (although a bit lower than the 60% I got). I used the default initialization of TF (glorot-uniform I believe). Do you mean the training accuracy on the speed prediction task? I believe around 60% with 4 speed classes.

KT27-A commented 3 years ago

Hi, thanks for your response. I need to check my code further. I found a strange thing when I trained on your source code. When I was going to pre-train only on speed prediction task, I set --transform 'orig' and delet skip_label = tf.concat([skip_label, skip_label], 0) at line 41 in train/VideoSSLTrainer.py. Finally, I got 65% in Evaluation, which is far higher than what you reported in the paper (49.3%). Although the training epochs and batch sizes are different from those in the paper, the gap seems strange. Did I misunderstand something? Thank you, man.

sjenni commented 3 years ago

Hi, if you used train_test_C3D.py for this, then the setup is different from Table 1 in the paper. The script does full fine-tuning of all the layers following the setup of Table 2. Otherwise, your steps to train only on speed prediction sound correct. You could specify trainscopes=''.join(['{}/fc{}'.format(net_scope, i+1) for i in range(3)]) in line 50 to keep the conv layers fixed.

KT27-A commented 3 years ago

Hi, Jenni, thanks for your patience. I nearly reproduced the performance on Pytorch platform with different learning rate settings on speed prediction task. Now I am still confused about one thing that why you used net = tf.pad(net, [[0, 0], [1, 1], [1, 1], [1, 1], [0, 0]]) before the 5th conv layer. I think directly using conv3d as the former conv layers is doing the same thing.