lucasnewman / best-rq-pytorch

Implementation of BEST-RQ - a model for self-supervised learning of speech signals using a random projection quantizer, in Pytorch.
MIT License
70 stars 7 forks source link

Missing Convolution Subsampling? #1

Open fmac2000 opened 9 months ago

fmac2000 commented 9 months ago

Hi Lucas, I'm looking over the code and I believe you have missed the two convolution subsampling layers in conformer.py,

4.1.1. NON-STREAMING MODELS The model has two convolution layers at the bottom which provide 4 times temporal-dimension reduction for the input sequences. The rest of the layers are a stack of Conformer models. We explore 0.6B model size which is extensively studied in the previous works. The model contains 24 layers of Conformer models.

Screenshot 2023-09-28 at 18 47 25

If you'd like I can create a pull request and implement this for you now. Thanks - If I've misunderstood the paper, please call me out! 😅

lucasnewman commented 9 months ago

@fmac2000 Yes, please, a PR would be great! I was aware of them in the paper but I skipped them for simplicity and downsampled in the feature extractor to get it off the ground. I would love to have it reflect the paper as closely as possible though!