guxm2021 / MM_ALT

[MM 2022 Oral] MM-ALT: A Multimodal Automatic Lyric Transcription System
Apache License 2.0
17 stars 0 forks source link

How did you extract features from Av-HuBERT #2

Closed shakeel608 closed 4 months ago

shakeel608 commented 1 year ago

Do you have any tutorial how did you extract features from several layers

guxm2021 commented 1 year ago

Thanks for your interests in our work. I extracted the features after the model AV-HuBERT. As shown in the figure, "self.modules.wav2vec2" is the AV-HuBERT. You can proceed to use the "video_feats" as the features of videos.

Screenshot 2023-02-15 at 00 38 13

Please check the script in "https://github.com/guxm2021/MM_ALT/blob/main/speechbrain/lobes/models/fairseq_wav2vec.py" to see how we adapt AV-HuBERT in our project.

shakeel608 commented 1 year ago

Thank you for your quick response. But what I understood from this code is that it is extracting the features from the last layer of the trained AV-HuBRT. I also want to extract features from intermediate layers?

In addition, I can not see the code snippet from the figure in the link you provided.

Could you please kindly help it out

guxm2021 commented 1 year ago

If you want to extract features from intermediate layers, please write the code by yourself in the script of speechbrain/lobes/models/fairseq_wav2vec.py (I think you may need to edit the forward function). AV-HuBERT contains a ResNet and a Transformer, you can extract the features from the ResNet output or the output from intermediate Transformer blocks (the Transformer contains 24 blocks). The code snippet from the figure in the link is from line 27-28 in N20EM/VideoALT/train_avhubert.py. Please let me know if you need further help. If you think our repo is helpful to your own research, please also consider star or folk our repo and cite our paper. Thanks!