jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.96k stars 505 forks source link

Mel spectograms from TensorflowTTS/tacotron2 #61

Open riverphoenix opened 3 years ago

riverphoenix commented 3 years ago

I trained a model using hifi-gan and VCTK and testing on the wav files it looks good

However, when I try to use the model on mel-spectrograms generated by TensorflowTTS/tacotron2 all I get is noise. I am sure that it has to do with preprocessing (using the tacotron2 LJSpeech pipeline) so asking if anyone knows/can help on what transformation I should do to the mel_spectogram produced so I can then apply the hifi-gan model.

Below are the specs used for each:

Pipeline for generation: processor = AutoProcessor.from_pretrained("models/tacotron2/ljspeech_mapper.json") config = AutoConfig.from_pretrained("models/tacotron2/tacotron2.v1.yaml") tacotron2 = TFAutoModel.from_pretrained(config=config, pretrained_path=None, is_build=True, name="tacotron2")

input_text = "And it is worth mention in passing that, as an example of fine typography," input_ids = processor.text_to_sequence(input_text)

tacotron2.setup_window(win_front=6, win_back=6) tacotron2.setup_maximum_iterations(3000) tacotron2.load_weights("models/tacotron2/exp/checkpoints/model.h5")

decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference( tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0), tf.convert_to_tensor([len(input_ids)], tf.int32), tf.convert_to_tensor([0], dtype=tf.int32))

mel_outputs = np.transpose(mel_outputs.numpy(), (0, 2, 1)) mel_outputs = torch.FloatTensor(mel_outputs).to(device)

config_file = 'models/hifi-gan/config.json' with open(config_file) as f: data = f.read() json_config = json.loads(data) h = AttrDict(json_config) torch.manual_seed(h.seed) torch.cuda.manual_seed(h.seed) device = torch.device('cpu')

inference(device,h,mel_outputs)

jik876 commented 3 years ago

Thanks for your interest. And sorry for the late reply. Since 2 stage TTS models use slightly different pre-processing, modifications are needed to make them compatible. The implementation we provide is compatible with NVIDIA Tacotron2 and Glow-TTS. I have seen that HiFi-GAN is already supported in the implementation you mentioned in the readme. In terms of compatibility, it is advisable to use an implementation that has already been tested.