Mel spectograms from TensorflowTTS/tacotron2

I trained a model using hifi-gan and VCTK and testing on the wav files it looks good

However, when I try to use the model on mel-spectrograms generated by TensorflowTTS/tacotron2 all I get is noise. I am sure that it has to do with preprocessing (using the tacotron2 LJSpeech pipeline) so asking if anyone knows/can help on what transformation I should do to the mel_spectogram produced so I can then apply the hifi-gan model.

Below are the specs used for each:

hifi-gan: config_v3.json (https://github.com/jik876/hifi-gan/blob/master/config_v3.json)
tacotron2: ljspeech_preprocess.yaml (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/preprocess/ljspeech_preprocess.yaml)

Pipeline for generation: processor = AutoProcessor.from_pretrained("models/tacotron2/ljspeech_mapper.json") config = AutoConfig.from_pretrained("models/tacotron2/tacotron2.v1.yaml") tacotron2 = TFAutoModel.from_pretrained(config=config, pretrained_path=None, is_build=True, name="tacotron2")

input_text = "And it is worth mention in passing that, as an example of fine typography," input_ids = processor.text_to_sequence(input_text)

tacotron2.setup_window(win_front=6, win_back=6) tacotron2.setup_maximum_iterations(3000) tacotron2.load_weights("models/tacotron2/exp/checkpoints/model.h5")

decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference( tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0), tf.convert_to_tensor([len(input_ids)], tf.int32), tf.convert_to_tensor([0], dtype=tf.int32))

mel_outputs = np.transpose(mel_outputs.numpy(), (0, 2, 1)) mel_outputs = torch.FloatTensor(mel_outputs).to(device)

config_file = 'models/hifi-gan/config.json' with open(config_file) as f: data = f.read() json_config = json.loads(data) h = AttrDict(json_config) torch.manual_seed(h.seed) torch.cuda.manual_seed(h.seed) device = torch.device('cpu')

inference(device,h,mel_outputs)

jik876 / hifi-gan

Mel spectograms from TensorflowTTS/tacotron2 #61