Closed weixsong closed 6 years ago
Hi @weixsong, that's a really good question!
For context, in case others are reading this, the way this is called in melspectrogram is:
def melspectrogram(y):
D = _stft(preemphasis(y))
S = _amp_to_db(_linear_to_mel(np.abs(D)))
return _normalize(S)
So _normalize is called on S which is a spectrogram with mel-scaled frequency buckets and decibel values.
The thinking is that the magnitudes in the STFT should always be between 0 and 1, and converting that to decibels will always be a negative number.
The normalization yields a nice distribution of values between 0 and 1 to predict. But I'm not sure if this is actually necessary. It's possible that the model would learn to predict the unnormalized values just fine too.
Hi @keithito , thanks for your reply. But I still have question. I'm using the following code to verify your answer:
y, sr = librosa.load('"t1_100000.wav"', sr=None)
spectrum = librosa.stft(y=y, n_fft=2048, hop_length=200, win_length=800)
D = np.abs(spectrum)
mel_basis = librosa.filters.mel(16000, 2048, n_mels=80)
mel_spectrum = np.dot(mel_basis, D)
>>> np.amax(mel_spectrum)
5.7544252176933997
>>> np.log10(5.7544252176933997)
0.76000195051132646
Seems input for amp_to_db()
is not negative values, is there anything wrong for me?
That's interesting. I think I may have misunderstood the range of values output by librosa.stft. There are also some audio files in the LJ Speech Dataset whose mel spectrograms have values outside the range of [0, 1]. However, they tend to be only slightly outside, like in this histogram:
I'd imagine the clipping doesn't have much of an effect. But you can also try removing the clipping. It's not really necessary for the model to work, i.e. it should be able to learn to predict values outside of the [0, 1] range fine.
@keithito , thanks very much.
Hi,
I'm not quite understand the mel spectrum normalization code:
why do it like this? is the Max(x) supposed to be 0?