Cannot infer spectrogram yield from Nvidia NeMo

Sueoka-ppc commented 2 years ago

I try to infer spectrogram which was yielded from NeMo glow_tts models (and Vocoder model is using pre-trained models jsut_multi_band_melgan.v2 or just_hifigan.v1 )

But blow error was happened RuntimeError: The size of tensor a (226) must match the size of tensor b (80) at non-singleton dimension 2

can anyone help me?

my code

import soundfile as sf
import datetime
from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder
from nemo.collections.tts.models import MelGanModel
from nemo.collections.tts.models import WaveGlowModel

timeS = datetime.datetime.now().strftime("%Y%m%d%H%M%S")

# Download and load the pretrained fastpitch model
spec_generator = SpectrogramGenerator.restore_from(restore_path="./examples/tts/glow_tts_train_ptm01/glow_tts_train01/checkpoints/glow_tts_train01.nemo").cuda()
# Download and load the pretrained hifigan model
#vocoder = Vocoder.from_pretrained(model_name="tts_hifigan").cuda()

#vocoder = MelGanModel.from_pretrained(model_name="tts_melgan").cuda()
#vocoder = WaveGlowModel.from_pretrained(model_name="tts_waveglow_268m").cuda()

# All spectrogram generators start by parsing raw strings to a tokenized version of the string
parsed = spec_generator.parse("koNnnichiwA.watashiwaeiaichandesu.")
# They then take the tokenized string and produce a spectrogram
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
# Finally, a vocoder converts the spectrogram to audio
#audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)

import os
import time

import numpy as np
import soundfile as sf
import torch
import yaml

from tqdm import tqdm

from parallel_wavegan.datasets import MelDataset
from parallel_wavegan.datasets import MelSCPDataset
from parallel_wavegan.utils import load_model
from parallel_wavegan.utils import read_hdf5

#checkpoint="./jsut-mbgan/checkpoint-1000000steps.pkl"
checkpoint ="./train_nodev_jsut_hifigan.v1/checkpoint-2500000steps.pkl"

dirname = os.path.dirname(checkpoint)
config = os.path.join(dirname, "config.yml")
with open(config) as f:
    config = yaml.load(f, Loader=yaml.Loader)

# setup model
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
model = load_model(checkpoint, config)
print(f"Loaded model parameters from "+checkpoint)

model.remove_weight_norm()
model = model.eval().to(device)

# start generation
total_rtf = 0.0
with torch.no_grad(): # tqdm(dataset, desc="[decode]") as pbar:
    #for idx, (utt_id, c) in enumerate(pbar, 1):
        # generate
    c = torch.tensor(spectrogram, dtype=torch.float).to(device)
    start = time.time()
    y = model.inference(c,normalize_before=True).view(-1)
    #rtf = (time.time() - start) / (len(y) / config["sampling_rate"])
    #pbar.set_postfix({"RTF": rtf})
    #total_rtf += rtf

sf.write("speech"+timeS+".wav", y.cpu().numpy(), 22050)

kan-bayashi commented 2 years ago

Please check shape. It might be transposed. https://github.com/kan-bayashi/ParallelWaveGAN/blob/e027f53ee7c5dc813d61cf3a47749a6e2abc9369/parallel_wavegan/models/hifigan.py#L251-L260

Sueoka-ppc commented 2 years ago

Sorry I am primitive of Tensor I don't know how to fix this.

Change shape or this spectrogram is not suitable this model?

kan-bayashi commented 2 years ago

Please check the shape of inputs. Maybe your input shape is (#mels, #frames) or (#batch, #frames, #mels) but my implementation assumes (#frames, #mels). Please modify the shape by yourself.

Sueoka-ppc commented 2 years ago

I try to reshape but runtime error happen

RuntimeError: Expected 3-dimensional input for 3-dimensional weight [512, 80, 7], but got 4-dimensional input of size [1, 80, 226, 1] instead

Can anyone help me?

kan-bayashi commented 2 years ago

You use wrong shape. Please carefully check docstring. Input is 2d (#frames, #mels) https://github.com/kan-bayashi/ParallelWaveGAN/blob/e027f53ee7c5dc813d61cf3a47749a6e2abc9369/parallel_wavegan/models/hifigan.py#L251-L260

Sueoka-ppc commented 2 years ago

reshape is successfully but generated wav is broken. I think miss match between spec generator and vocoder.

Thanks to support me.

kan-bayashi / ParallelWaveGAN

Cannot infer spectrogram yield from Nvidia NeMo #344