Question about the size of the log-mel spectrogram of audio

I apologize for asking repeatedly. In the paper's 5-A. Datasets chapter, the data preprocessing method was described. The voice uses a sampling rate of 16 kHz, 4 seconds. N=1024, H=64, window size=512 for the Hamming window, the final output log-mel spectrogram size is (128,128).

To reproduce this, I wrote the following code using librosa.

def audio_path_to_mel_spectrogram(
    file_path: str,
    sr: int = 16000,
    duration: float = 4.0,
    n_fft: int = 1024,
    hop_length: int = 64,
    win_length: int = 512,
    n_mels: int = 128,
) -> np.ndarray:
    """
    Preprocess audio file to generate log-Mel spectrogram image.

    Args:
        file_path (str): Path to the audio file.
        sr (int): Sampling rate (default: 16000 Hz).
        duration (float): Duration to limit the audio (default: 4.0 seconds).
        n_fft (int): Number of FFT components (default: 1024).
        hop_length (int): Number of samples between successive frames per second (default: 64).
        win_length (int): Window size (default: 512).
        n_mels (int): Number of Mel bands (default: 128).
        augment (bool): Whether to apply data augmentation (default: True).

    Returns:
        np.ndarray: Log-Mel spectrogram image of shape (128, 128).
    """
    # Load audio file
    y, _ = librosa.load(file_path, sr=sr, duration=duration)

    # Apply zero padding if necessary
    target_length = int(sr * duration)
    if len(y) < target_length:
        y = np.pad(y, (0, target_length - len(y)))
    else:
        y = y[:target_length]

    hop_duration = hop_length / sr
    n_fft_duration = n_fft / sr
    print(f"hop duration is {hop_duration:.5f}, fft_duration is {n_fft_duration:.5f}")

    # Compute log-Mel spectrogram
    mel_spectrogram = librosa.feature.melspectrogram(
        y=y,
        sr=sr,
        n_fft=n_fft,
        hop_length=hop_length,
        win_length=win_length,
        window='hamming',
        n_mels=n_mels
    )

    print(mel_spectrogram.shape)

    # Convert to log scale
    log_mel_spectrogram = librosa.power_to_db(mel_spectrogram)

    return log_mel_spectrogram

When I did this, the mel_spectrogram.shape was (128,1001). For a 16kHz audio of 4 seconds, the total samples are 64000, and dividing this by the hop length of 64 gives 1000, so the numbers seem reasonable. Should I cut this to make it (128,128)? I've processed it that way for now, but I'm wondering if this is correct since it would mean using only about 0.5 seconds of audio. Or did I make a mistake somewhere?

Hello. Feel free to ask any questions :) You're right about your suspicions. We used the following papers to set up our experimental conditions. SepTr: Separable Transformer for Audio Spectrogram Processing The above paper uses a spectrogram with a square shape to calculate some computational complexity. So, we forced spectrogram image size to be square shape by plt.savefig() for all our experiments. There may be a performance difference, but we have verified that our proposed method works better even on compressed (interpolatled) or not.

If you want to use a spectrogram image with a size of 128 x 1001 (height, width), you can do so by simply changing the img_size.

img_size = (128,1001) # (height, width)

teacher = Teacher(
    image_size = img_size,
    patch_size = patch_size,
    num_classes = num_classes,
    dim = 256,
    depth = 6,
    heads = 5,
    mlp_dim = 256,
    dropout = 0.,
    emb_dropout = 0.,
    channels = 1,
    max_bs = bs
)

You can use teacher_1001_92.0558_CREMA_D.ckpt for 128 x 1001 spectrogram image from CREMA-D.

kjy7567 / speech_emotion_recognition_from_log_Mel_spectrogram_using_vertically_long_patch

Question about the size of the log-mel spectrogram of audio #3