rishikksh20 / AudioMAE-pytorch

Unofficial PyTorch implementation of Masked Autoencoders that Listen
MIT License
64 stars 6 forks source link

How to get the audio image? #1

Open Ian-Tam opened 2 years ago

Ian-Tam commented 2 years ago

Thanks for your outstanding work. I'm just getting started with speech signal processing and I have a question. For the example in the readme file, the input is a 1024*128 image, how should we get this image? In general, we use Librosa or TorchAudio to process the audio, and we will get a matrix. Can I use this matrix directly? Lirbosa visualizes the melspec features as images with a few more operations,such as:

import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
S_dB = librosa.power_to_db(audio_mel, ref=np.max)
img = librosa.display.specshow(S_dB, x_axis='time',
                         y_axis='mel', sr=sr,
                         fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency spectrogram')

So I'm wondering if you're using the visualized image as input or the matrix of the audio Mel-spec?

rishikksh20 commented 2 years ago

Convert Audio to MelSpectrogram with 128 bins, then you can treat mel-spectrogram as an image.