ixobert / birds-generation

11 stars 1 forks source link

Spectrogram to audio conversion #15

Open masonyoungblood opened 2 months ago

masonyoungblood commented 2 months ago

I know that spectrogram to audio conversion has been brought up in several other issues (e.g., #3 and #12), but I don't think it has been fully addressed yet. The spectrogram output from interactive_app.py looks excellent, but the audio output sounds wrong and has a much shorter duration than the input audio. I've been troubleshooting spectrogram to audio conversion with a modified form of the code from interactive_app.py, and I'm getting the best results with...

def spectrogram_to_audio(spec, out, sr = 16384, n_fft = 1024, dur = 4):
    spec = cv2.resize(spec, (129, 128))
    spec = librosa.db_to_power(spec)
    spec = librosa.feature.inverse.mel_to_audio(spec, sr = sr, power = 2, n_iter = 32, length = sr*dur)
    soundfile.write(out, spec, sr)

The output audio is the correct duration, but is extremely poor quality (beyond what is expected from spectrogram to audio conversion). I think it may be that the FFT argument is not being specificed, but if I try to set n_fft in mel_to_audio it throws errors, unless I remove length = sr*dur in which case the duration of the audio is wrong.

This leads me to think that there is something atypical about the spectrogram files from generate_samples.py that makes inversion difficult to perform. Do you have any idea what the issue might be?

ixobert commented 2 months ago

@masonyoungblood, the implementation aims to produce augmented spectrograms for classifier training, especially useful when sample sizes are small. However, audibility varies depending on the input. For instance, interpolating between two significantly different audio files (like varying pitches or decibel levels) can affect the outcome.

Here is the catch: spectrograms are a lossy representation of the input waveform, particularly because they remove the phase information from the signal. They lack phase information for example, making it challenging to accurately reconstruct audio. The Griffin-Lim method tries to estimate this missing phase, but it's not perfect. This can lead to poor audio quality, particularly if the FFT settings and durations aren't well-matched to the specifics of the generated spectrograms. Adjusting these parameters may help, but some degradation in audio quality is often unavoidable due to these inherent limitations.

I have seen some works attempting to reconstruct the waveform signals from spectrograms, and they claims promising results.