lmnt-com / diffwave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.
Apache License 2.0
754 stars 111 forks source link

Inference #15

Closed Pranjalya closed 3 years ago

Pranjalya commented 3 years ago

While inferencing with the provided LJSpeech pretrained model and one of the reference audio, the output is a very low amplitude sound (almost silence). And while I used a trained model over a custom dataset, the result was static noise on inferencing. What could be going wrong?

yeswecan commented 3 years ago

Interested in this too. I trained it on a custom dataset and tried the pretrained one with the same result. @Pranjalya were you able to progress regarding this?

Pranjalya commented 3 years ago

@yeswecan Sadly, not yet.

sharvil commented 3 years ago

Can you provide repro steps for the pretrained model? I've verified that it works fine over here with the following recipe:

git clone https://github.com/lmnt-com/diffwave
cd diffwave
wget https://lmnt.com/assets/diffwave/diffwave-ljspeech-22kHz-1000578.pt
# copy or link LJSpeech's wavs/ directory into the current directory
python src/diffwave/preprocess.py wavs/
python src/diffwave/inference.py diffwave-ljspeech-22kHz-1000578.pt wavs/LJ001-0001.wav.spec.npy -o output.wav
YueZhou-oh commented 3 years ago

Can you provide repro steps for the pretrained model? I've verified that it works fine over here with the following recipe:

git clone https://github.com/lmnt-com/diffwave
cd diffwave
wget https://lmnt.com/assets/diffwave/diffwave-ljspeech-22kHz-1000578.pt
# copy or link LJSpeech's wavs/ directory into the current directory
python src/diffwave/preprocess.py wavs/
python src/diffwave/inference.py diffwave-ljspeech-22kHz-1000578.pt wavs/LJ001-0001.wav.spec.npy -o output.wav

I got the same problem as @Pranjalya . I use LJ001-0001.wav in LJSpeech-1.1 to generate mel-spectrogram LJ001-0001.wav.spec.npy through

python src/diffwave/preprocess.py wavs/

, and then generate output.wav through

wget https://lmnt.com/assets/diffwave/diffwave-ljspeech-22kHz-1000578.pt
python src/diffwave/inference.py diffwave-ljspeech-22kHz-1000578.pt wavs/LJ001-0001.wav.spec.npy -o output.wav -f

PS. torch-cpu version is used. here is the output file, output.zip

sharvil commented 3 years ago

Thanks for the attachment; that was very helpful.

Okay, looks like torchaudio made some breaking changes and that's why folks are running into this problem. I'm going to pin the requirement to torchaudio==0.7.0 instead of torchaudio>=0.6.0.

The issue is that torchaudio used to load 16-bit samples to floating point in [-32768.0, 32767.0]. Newer versions of torchaudio rescale the samples to [-1, 1] (and they got rid of torchaudio.load_wav entirely in 0.9.0).

YueZhou-oh commented 3 years ago

Thanks for the attachment; that was very helpful.

Okay, looks like torchaudio made some breaking changes and that's why folks are running into this problem. I'm going to pin the requirement to torchaudio==0.7.0 instead of torchaudio>=0.6.0.

The issue is that torchaudio used to load 16-bit samples to floating point in [-32768.0, 32767.0]. Newer versions of torchaudio rescale the samples to [-1, 1] (and they got rid of torchaudio.load_wav entirely in 0.9.0).

oherwise, revising transform function works.

# line33 in preprocess.py
# audio = torch.clamp(audio[0] / 32767.5, -1.0, 1.0)
 audio = torch.clamp(audio[0], -1.0, 1.0)
jayachandrakalakutagar commented 2 years ago

i am also getting a static noise from the output of diffwave vocoder can you please help me here

YueZhou-oh commented 2 years ago

updating torchaudio version to 0.7.0, or revising line#33 in preprocess.py also works.

jayachandrakalakutagar commented 2 years ago

@YueZhou-oh actually version is 0.9.0 and the melspectrogram is produced by a different model and I want to produce the outputfrom this vocoder , after production the voice is simply noice

YueZhou-oh commented 2 years ago

@YueZhou-oh actually version is 0.9.0 and the melspectrogram is produced by a different model and I want to produce the outputfrom this vocoder , after production the voice is simply noice

sames like an amplitute issue, maybe you can check the generated melspectrogram amplitute range between your model and src/diffwave/preprocess.py