Open Changesong opened 3 days ago
Hello. Feel free to ask any questions :) You're right about your suspicions. We used the following papers to set up our experimental conditions. SepTr: Separable Transformer for Audio Spectrogram Processing The above paper uses a spectrogram with a square shape to calculate some computational complexity. So, we forced spectrogram image size to be square shape by plt.savefig() for all our experiments. There may be a performance difference, but we have verified that our proposed method works better even on compressed (interpolatled) or not.
If you want to use a spectrogram image with a size of 128 x 1001 (height, width), you can do so by simply changing the img_size.
img_size = (128,1001) # (height, width)
teacher = Teacher(
image_size = img_size,
patch_size = patch_size,
num_classes = num_classes,
dim = 256,
depth = 6,
heads = 5,
mlp_dim = 256,
dropout = 0.,
emb_dropout = 0.,
channels = 1,
max_bs = bs
)
You can use teacher_1001_92.0558_CREMA_D.ckpt for 128 x 1001 spectrogram image from CREMA-D.
I apologize for asking repeatedly. In the paper's 5-A. Datasets chapter, the data preprocessing method was described. The voice uses a sampling rate of 16 kHz, 4 seconds. N=1024, H=64, window size=512 for the Hamming window, the final output log-mel spectrogram size is (128,128).
To reproduce this, I wrote the following code using librosa.
When I did this, the mel_spectrogram.shape was (128,1001). For a 16kHz audio of 4 seconds, the total samples are 64000, and dividing this by the hop length of 64 gives 1000, so the numbers seem reasonable. Should I cut this to make it (128,128)? I've processed it that way for now, but I'm wondering if this is correct since it would mean using only about 0.5 seconds of audio. Or did I make a mistake somewhere?