Thanks for your outstanding work.
I'm just getting started with speech signal processing and I have a question. For the example in the readme file, the input is a 1024*128 image, how should we get this image? In general, we use Librosa or TorchAudio to process the audio, and we will get a matrix. Can I use this matrix directly?
Lirbosa visualizes the melspec features as images with a few more operations,such as:
Thanks for your outstanding work. I'm just getting started with speech signal processing and I have a question. For the example in the readme file, the input is a 1024*128 image, how should we get this image? In general, we use Librosa or TorchAudio to process the audio, and we will get a matrix. Can I use this matrix directly? Lirbosa visualizes the melspec features as images with a few more operations,such as:
So I'm wondering if you're using the visualized image as input or the matrix of the audio Mel-spec?