Closed v-wewei closed 5 years ago
Hi,
Thanks for asking. This the backbone I3D I used https://github.com/piergiaj/pytorch-i3d
And this is how I sample segments and extract their features https://github.com/noureldien/timeception/blob/master/datasets/charades.py#L719 https://github.com/noureldien/timeception/blob/master/datasets/charades.py#L311
For the best results, in each epoch, I sample new segments and extract their features using I3D. During testing, I average the scores of 10 random crops. Previous works as Non-Local even test on 30 crops.
In fact i see you sample frame according to the action sample rather than video sample, could you tell me why ? charades is a multi-label task right ?
I am the author of only Timeception. In my code (this repository), I sample from the video. I don't have access to temporal annotation of the actions in the video during training.
As for the backbone CNN (either I3D or ResNet-3D), I didn't train/fine-tune them on Charades. Please consult their authors. Also, please consult the authors of Non-local, Video-space-time-graph, Feature-banks, Slow-Fast network, and ask them why thet use temporal annotation when training the backbone CNN on Charades.
For example, look at this paper: Long-term feature banks, page 11 "Appendix H. Charades Training Schedule". https://arxiv.org/abs/1812.05038. They said: "We train the models to predict the ‘clip-level’ labels, i.e., the union of the frame labels that fall into the clip’s temporal range."
So, please go and ask them why they did this? It's unfair to access temporal annotation (i.e. action localization) while you claim your paper only does action recognition. Did you notice something about them all? They all did their work at Facebook.
yeah, you just catch an important point in this task. Maybe i just misunderstand your code for the "sl" in the config file which means single label training. so, you mean you just use multi-label training right ?
That's because in the paper there are different datasets: Breakfast is single-label classification. While Charades and MultiThumos are multi-label classification. If you notice, in the config file of Charades, the flag is 'ml' i.e. multi-label. Also, if you look at the code, you find we use sigmoid as the activation of the classification.
okay, i am just so sorry to misunderstand your code. while i want to make sure if the i3d baseline model was finetuned on charades dataset with single lable ( use the temporal annotation )or multi-label ?
Again, I didn't fune tune neither I3D nor ResNet3D. I got them from here and used as-it-is: https://github.com/piergiaj/pytorch-i3d https://github.com/facebookresearch/video-long-term-feature-banks
okay , seems the backbone i3d model you got had already finetuned on charades in their repo. And thanks very much for your clearly reply. i will close this issue soon.
could you pls provide the config for the baseline model, since i want to reproduce the result for you baseline i3d model. Thank you very much !