How to get the audio image?

Thanks for your outstanding work. I'm just getting started with speech signal processing and I have a question. For the example in the readme file, the input is a 1024*128 image, how should we get this image? In general, we use Librosa or TorchAudio to process the audio, and we will get a matrix. Can I use this matrix directly? Lirbosa visualizes the melspec features as images with a few more operations，such as：

import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
S_dB = librosa.power_to_db(audio_mel, ref=np.max)
img = librosa.display.specshow(S_dB, x_axis='time',
                         y_axis='mel', sr=sr,
                         fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency spectrogram')

So I'm wondering if you're using the visualized image as input or the matrix of the audio Mel-spec?

rishikksh20 / AudioMAE-pytorch

How to get the audio image? #1