microsoft / UniSpeech

UniSpeech - Large Scale Self-Supervised Learning for Speech
Other
406 stars 71 forks source link

More details about the output #6

Closed AhmedHashish123 closed 2 years ago

AhmedHashish123 commented 2 years ago

When I try to run the example in UniSpeech-SAT directory in this repo, I get 'f' as a tensor of size torch.Size([1, 512, 31]). What exactly does the variable f represent?

Sanyuan-Chen commented 2 years ago

Hi @AhmedHashish123 ,

Thank you for the interest and the question! We accidentally use the wrong feature extract function in our example usage of loading pretrained models. We have fixed it in the README file. If you go through the example usage, the "f" tensor it returns should be a tensor of size torch.Size([1, 31, 768]), where 1 is the batch size, 31 is the time step (Since the window size and stride size of the CNN feature extractors are 400 and 320 respectively, the time step is (10000 - 80) / 320 = 31), and 768 is the hidden state dimension.

AhmedHashish123 commented 2 years ago

Hi @AhmedHashish123 ,

Thank you for the interest and the question! We accidentally use the wrong feature extract function in our example usage of loading pretrained models. We have fixed it in the README file. If you go through the example usage, the "f" tensor it returns should be a tensor of size torch.Size([1, 31, 768]), where 1 is the batch size, 31 is the time step (Since the window size and stride size of the CNN feature extractors are 400 and 320 respectively, the time step is (10000 - 80) / 320 = 31), and 768 is the hidden state dimension.

Thank you for making it clear