keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.94k stars 965 forks source link

I want to extract the mel_output but meet problem #302

Closed xieyuankun closed 4 years ago

xieyuankun commented 4 years ago

I just change the five lines of synthesizer.py to extract mel spectrogram and alignment.

class synthesizer:
self.mel_output = self.model.mel_outputs[0]
self.alignment = self.model.alignments[0]

def synthesize:
wav, mels,alignment = self.session.run([self.wav_output, self.mel_output, self.alignment], feed_dict=feed_dict)
mel_filename = os.path.join(out_dir, 'mel-1.npy')
np.save(mel_filename, mels,allow_pickle=False)

but when i load the mel-1.npy , the shape of mel spectrogram is [5000,80]? Then i change the mel_output to linear_output and the shape of linear is [5000,1025]. The alignment is correct and the generated speech sounds great. So what's wrong with my code? I want put mel.npy into another trained vocoder.

In eval.py, i just have one sentence to generate.

tests = ['After breakfast he came and fitted me with a bridle .']

Here is the spectrogram 在这里插入图片描述

xieyuankun commented 4 years ago

I have solved it. The reason why the shape of spectrogram is 5000 is the model did not add stop token loss and it will make each spectrogram have the maximum length.

sujeendran commented 4 years ago

@xieyuankun Hi sorry to get back on this after so long, but can you explain how you you solved the issue with mel spectrogram shape? How can the stop token be used to limit the shape?

xieyuankun commented 4 years ago

@xieyuankun Hi sorry to get back on this after so long, but can you explain how you you solved the issue with mel spectrogram shape? How can the stop token be used to limit the shape?

In his version of Tacotron1, there is no stop token loss to limit the length of mels and linears. The shape of mel [N, T_out, M] is not a fixed value. T_out maybe 5000 or correct value depending on the input of text. In the synthesizing process, If the text is from training data, due to the memory of neural network(dynamic decoder), the length of mels may be a correct value. However, if the text is random, the length of mels generated by decoder maybe equals to 5000(max decode value) and the audio.find_endpoint will cut the silent time of finally wave. In Tacotron2, stop token loss is added to adjust the T_out into a appropriate value so that we can extract the correct mels and use vocoder to generate high quality audio from the mels. If you want to extract a correct mel spctrogram, you can add the stop token loss in Tacotron1 or use Tacotron2.