Closed kffeng closed 1 year ago
Hi! I recently faced a similar issue. I had to slightly modify the code so it actually loads these spatial weights.
I think the spatial weights are pre-trained using DINO (the image version of this paper). The URL to download such weights is found here. But they should be downloaded automatically if the proper functions are called within the code. Check this issue for more information.
I hope it helps.
Thanks for the answer and pointing that out @javierselva - we are working on making this fix!
@javierselva Hello, thank you for your answer, it has been very helpful to me. Can I ask you some more questions? How is the sampling done during finetuning and what should the sampling_rate be? What should self.cfg.DATA.NUM_FRAMES be? 64 or 8? Are fast_frames selected first and then slow_frames are selected from them? Should NUM_ENSEMBLE_VIEWS be 10 or 1? Because the paper mentions: We use two clips per video sampled at different spatiotemporal resolutions (T, W, H) ∈ {(8, 224, 224), (64, 96, 96)} with 3 spatial crops each for testing (6 clips in total). So it should be 1, but the code is 10, so I’m a bit confused. Thank you for your answer.
Hi @kffeng, I'm afraid I cannot help you much with this. When I reproduced the experiments, I kept all these parameters as they were by default. But here are my beliefs, maybe @kahnchana can correct me.
During training, a video is selected from the dataset and two clips are sampled, one with 64 frames but smaller spatial resolution (96x96) and another with just 8 frames but bigger spatial resolution (224x224). Then, NUM_ESEMBLE_VIEWS is used to define the number of different views that will be used for the alignment task. If I recall correctly they train the networks so the representation of one view is predictive of other views. This is done through data augmentation, resulting in a total of 10 views, two of them are cropped and augmented from the 64x96² clip (coined as global views) and eight views from the 8x224² clip (coined as local views).
You can further check whether this is correct by going back to the paper and the Dataset classes used. In particular the one of kinetics, as it is the one used during training. Take a look at what is returned by the calls to the decoder (video loading), and what the dataloader ends up returning after producing all the views (the images variable).
I hope this helps, and again, I may be wrong 😅
@javierselva ,Thank you for your reply. I will take a closer look at the content in the paper.
Hi, dear team. Thank you for your excellent work. I am a beginner and would like to ask about the three places in your code where pre-trained weights are loaded. One is in
timesformer.py
inclass vit_base_patch16_224(nn.Module): where if self.pretrained: load_pretrained(self.model, …)
. Another is intrain.ssl.py:
if args.pretrained_rgb is not None: state_dict = torch.load(args.pretrained_rgb)["teacher"]….
The last one istrain.ssl.py:
msg = teacher_without_ddp.load_state_dict(student.module.state_dict(), strict=False) print(f"initialized teacher with student msg: {msg}").
I would like to know what the differences are between them. Also, your paper states:“We randomly initialize weights relevant to temporal attention while spatial attention weights are initialized using a ViT model trained in a self-supervised manner over the ImageNet-1k dataset.”
So which of these three places where pre-trained weights are loaded is for loading spatial attention? Finally, I would like to ask that it seems that you have not provided the weights for loading spatial attention. So I would also like to ask which paper’s self-supervised trained ViT weights you are loading. I look forward to your reply and once again express my gratitude to you.