keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.96k stars 958 forks source link

how to normalize the mel spectrum? #98

Closed weixsong closed 6 years ago

weixsong commented 6 years ago

Hi,

I'm not quite understand the mel spectrum normalization code:

def _normalize(S):
  return np.clip((S - hparams.min_level_db) / -hparams.min_level_db, 0, 1)

why do it like this? is the Max(x) supposed to be 0?

keithito commented 6 years ago

Hi @weixsong, that's a really good question!

For context, in case others are reading this, the way this is called in melspectrogram is:

def melspectrogram(y):
  D = _stft(preemphasis(y))
  S = _amp_to_db(_linear_to_mel(np.abs(D)))
  return _normalize(S)

So _normalize is called on S which is a spectrogram with mel-scaled frequency buckets and decibel values.

The thinking is that the magnitudes in the STFT should always be between 0 and 1, and converting that to decibels will always be a negative number.

The normalization yields a nice distribution of values between 0 and 1 to predict. But I'm not sure if this is actually necessary. It's possible that the model would learn to predict the unnormalized values just fine too.

weixsong commented 6 years ago

Hi @keithito , thanks for your reply. But I still have question. I'm using the following code to verify your answer:

y, sr = librosa.load('"t1_100000.wav"', sr=None)

spectrum = librosa.stft(y=y, n_fft=2048, hop_length=200, win_length=800)
D = np.abs(spectrum)
mel_basis = librosa.filters.mel(16000, 2048, n_mels=80)
mel_spectrum = np.dot(mel_basis, D)

>>> np.amax(mel_spectrum)
5.7544252176933997

>>> np.log10(5.7544252176933997)
0.76000195051132646

Seems input for amp_to_db() is not negative values, is there anything wrong for me?

keithito commented 6 years ago

That's interesting. I think I may have misunderstood the range of values output by librosa.stft. There are also some audio files in the LJ Speech Dataset whose mel spectrograms have values outside the range of [0, 1]. However, they tend to be only slightly outside, like in this histogram: image

I'd imagine the clipping doesn't have much of an effect. But you can also try removing the clipping. It's not really necessary for the model to work, i.e. it should be able to learn to predict values outside of the [0, 1] range fine.

weixsong commented 6 years ago

@keithito , thanks very much.