pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.55k stars 657 forks source link

Generating good quality spectrograms for analysis using torchaudio #1023

Closed ghost closed 3 years ago

ghost commented 4 years ago

Hi,

I am highly interested in generating spectrograms in GPU, However, the spectrograms generated using torch audio do not seem to be of that good quality compared to the spectrograms generated by Librosa. However, Librosa doesn't run on GPU.

To make it more clear please find the attached spectrograms.

For torchaudio: torchaudio

For Librosa: acrocephalus-arundinaceus-178789 mp3

cpuhrsch commented 4 years ago

Hello @romanshrestha17-iv,

Thank you for posting this issue and moving the conversation here! Is the file you're using to generate these images public?

Thanks, Christian

cc @mthrok @vincentqb

ghost commented 4 years ago

Hi @cpuhrsch ,

Thank you for your prompt response. Yes, I tried to generate spectrograms for some of the files from the kaggle dataset to just test out how the spectrograms would look.

Now we are moving onto a confidential dataset soon, but if its able to generate spectrograms that look as good as Librosa that would be great. Please find the link to dataset below.

https://www.kaggle.com/monogenea/birdsongs-from-europe

vincentqb commented 4 years ago

Thanks for the opening the issue! I'm getting very similar MelSpectrogram here with that dataset, do you have a code so we can reproduce?

keunwoochoi commented 4 years ago

@romanshrestha17-iv If it's a manual inspection, I'd say you should just stick with librosa which has many default values set up nicely. Otherwise, check out options like i) using mel-spectrogram instead of STFT ii) use decibel scaling iii) clamp the input of decibel scaling (check out how librosa does), or similarly, do linear_to_decibel(1 + abs(melspectrogram)).

ghost commented 4 years ago

@romanshrestha17-iv If it's a manual inspection, I'd say you should just stick with librosa which has many default values set up nicely. Otherwise, check out options like i) using mel-spectrogram instead of STFT ii) use decibel scaling iii) clamp the input of decibel scaling (check out how librosa does), or similarly, do linear_to_decibel(1 + abs(melspectrogram)).

@keunwoochoi Thanks for the suggestion. I've tried all of them out and still there is not much difference with torch audio spectrograms. I'm using librosa at the moment. I'm particularly interested in torchaudio because GPU's could help accelerate the spectrogram generation process.

ghost commented 4 years ago

Thanks for the opening the issue! I'm getting very similar MelSpectrogram here with that dataset, do you have a code so we can reproduce?

Thank you for investigating this issue Vincent. a snippet of the code used to generate log-mel spectrogram from librosa is here:

def scale_minmax(X, min=0.0, max=1.0):
    X_std = (X - X.min()) / (X.max() - X.min())
    X_scaled = X_std * (max - min) + min
    return X_scaled

def spectrogram_image(y, sr, out, hop_length, n_mels):
    # use log-melspectrogram
    mels = lr.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels,
                                            n_fft=hop_length*2, hop_length=hop_length)
    mels = np.log(mels + 1e-9) # add small number to avoid log(0)

    # min-max scale to fit inside 8-bit range
    img = scale_minmax(mels, 0, 255).astype(np.uint8)
    img = np.flip(img, axis=0) # put low frequencies at the bottom in image
    img = 255-img # invert. make black==more energy

    # save as PNG
    skimage.io.imsave(out, img)

--------------------------------------------------

files=len(audio_f)
spk_ID = [audio_f[i].split('/')[-1].lower() for i in range(files)]

for i in range(files):

    if __name__ == '__main__':
        # settings
        hop_length = 512 # number of samples per time-step in spectrogram
        n_mels = 128 # number of bins in spectrogram. Height of image
        time_steps = 384 # number of time-steps. Width of image
        # load audio.
        y, sr = lr.load(audio_f[i])
        out = "{}.png".format(spk_ID[i])
        #         .format(audio_f[file]) 

        # extract a fixed length window
        start_sample = 0 # starting at beginning
        length_samples = time_steps*hop_length
        window = y[start_sample:start_sample+length_samples]

        # convert to PNG
        spectrogram_image(window, sr=sr, out=out, hop_length=hop_length, n_mels=n_mels)
print('done!')
cpuhrsch commented 4 years ago

@romanshrestha17-iv Could you also please post the code you used to generate the torchaudio image?

ghost commented 4 years ago

@cpuhrsch the code use to generate torchaudio image is the similar to what @vincentqb has shared in his notebook here

mthrok commented 3 years ago

We have updated the tutorial with how to generate a MelSpectrogram which is numerically comparable with librosa. Please checkout the new tutorial. https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html#id1

Up to release version 0.8, torchaudio could only generate the equivalent of librosa's htk=True spectrograms, but recently we also added the support for htk=False too. We will follow up on this in the next release.