Open ggbangyes opened 1 year ago
N
= batch size
C
= number of channels (number of mel dimensions)
W
= width or time dimension
I also recommend reading through https://github.com/lmnt-com/diffwave/blob/master/src/diffwave/preprocess.py for more details on the spectrogram.
Thanks you Sharvil, your answer is handy and I can accomplish the high quality of audio to mel then to audio.
It seem like the preprocess.py file try do some normalize to the mel spec data, like:
spectrogram = 20 * torch.log10(torch.clamp(spectrogram, min=1e-5)) - 20
spectrogram = torch.clamp((spectrogram + 100) / 100, 0.0, 1.0)
But for different kind of mel spec how to make sure these nomalization hyper-parameter?
the required spectrogram form is like [N,C,W].
spectrogram = # get your hands on a spectrogram in [N,C,W] format
could you please explain these three dimensions?
I use the code from this repo: https://github.com/CorentinJ/Real-Time-Voice-Cloning to produce the mel spectrogram and use diffwave as the vocoder. But I only get the audio full of noises.
generate mel spectrogram
specs = synthesizer.synthesize_spectrograms(texts, embeds) #len(specs) == 1 spec = specs[0] #spec numpy.array, float32, shape(80, 314) spec = torch.tensor(spec)
Generating the waveform
diffwave_dir = "/hdd/haoran_project/diffwave-master/pretrained_models/diffwave-ljspeech-22kHz-1000578.pt" generated_wav, sample_rate = diffwave_predict(spec, diffwave_dir, fast_sampling=True)
Save it on the disk
filename = "results/diffwave_Elon.wav" print(generated_wav.dtype, " ",generated_wav.shape) # torch.float32 torch.Size([1, 87040]) torchaudio.save(filename, generated_wav.cpu(), sample_rate=sample_rate)