Question in the inference

ggbangyes commented 1 year ago

the required spectrogram form is like [N,C,W].

spectrogram = # get your hands on a spectrogram in [N,C,W] format

could you please explain these three dimensions?

I use the code from this repo: https://github.com/CorentinJ/Real-Time-Voice-Cloning to produce the mel spectrogram and use diffwave as the vocoder. But I only get the audio full of noises.

generate mel spectrogram

specs = synthesizer.synthesize_spectrograms(texts, embeds) #len(specs) == 1 spec = specs[0] #spec numpy.array, float32, shape(80, 314) spec = torch.tensor(spec)

Generating the waveform

diffwave_dir = "/hdd/haoran_project/diffwave-master/pretrained_models/diffwave-ljspeech-22kHz-1000578.pt" generated_wav, sample_rate = diffwave_predict(spec, diffwave_dir, fast_sampling=True)

Save it on the disk

filename = "results/diffwave_Elon.wav" print(generated_wav.dtype, " ",generated_wav.shape) # torch.float32 torch.Size([1, 87040]) torchaudio.save(filename, generated_wav.cpu(), sample_rate=sample_rate)

sharvil commented 1 year ago

N = batch size C = number of channels (number of mel dimensions) W = width or time dimension

I also recommend reading through https://github.com/lmnt-com/diffwave/blob/master/src/diffwave/preprocess.py for more details on the spectrogram.

ggbangyes commented 1 year ago

Thanks you Sharvil, your answer is handy and I can accomplish the high quality of audio to mel then to audio.

It seem like the preprocess.py file try do some normalize to the mel spec data, like:

spectrogram = 20 * torch.log10(torch.clamp(spectrogram, min=1e-5)) - 20
spectrogram = torch.clamp((spectrogram + 100) / 100, 0.0, 1.0)

But for different kind of mel spec how to make sure these nomalization hyper-parameter?

lmnt-com / diffwave