Closed shakeel608 closed 4 months ago
Thanks for your interests in our work. I extracted the features after the model AV-HuBERT. As shown in the figure, "self.modules.wav2vec2" is the AV-HuBERT. You can proceed to use the "video_feats" as the features of videos.
Please check the script in "https://github.com/guxm2021/MM_ALT/blob/main/speechbrain/lobes/models/fairseq_wav2vec.py" to see how we adapt AV-HuBERT in our project.
Thank you for your quick response. But what I understood from this code is that it is extracting the features from the last layer of the trained AV-HuBRT. I also want to extract features from intermediate layers?
In addition, I can not see the code snippet from the figure in the link you provided.
Could you please kindly help it out
If you want to extract features from intermediate layers, please write the code by yourself in the script of speechbrain/lobes/models/fairseq_wav2vec.py (I think you may need to edit the forward function). AV-HuBERT contains a ResNet and a Transformer, you can extract the features from the ResNet output or the output from intermediate Transformer blocks (the Transformer contains 24 blocks). The code snippet from the figure in the link is from line 27-28 in N20EM/VideoALT/train_avhubert.py. Please let me know if you need further help. If you think our repo is helpful to your own research, please also consider star or folk our repo and cite our paper. Thanks!
Do you have any tutorial how did you extract features from several layers