jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
651 stars 150 forks source link

griffin-lim gives strange output #29

Closed OnceJune closed 3 years ago

OnceJune commented 3 years ago

hi, I tried the code with Chinese corpus, with config: "sampling_rate": 16000, "filter_length": 1024, "hop_length": 200, "win_length": 800, "n_mel_channels": 80, "mel_fmin": 96.0, "mel_fmax": 7600.0,

The corpus is about 20hours and I picked up the 160th epoch to generate my mel spec. I tried with griffin-lim by modifiy inference.ipynb: (y_gen_tst, *r), attn_gen, *_ = model(x_tst, x_tst_lengths, gen=True, noise_scale=noise_scale, length_scale=length_scale) mel_np = y_gen_tst.cpu().squeeze(0).numpy() res = librosa.feature.inverse.mel_to_audio(mel_np, sr=16000, n_fft=1024, hop_length=200, win_length=800) and finally: librosa.output.write_wav('sample_output.wav', res, 16000) And it outputs a long silence like:

long-silence

The question is should I wait for more epochs? Or maybe I used griffin-lim the wrong way?

BTW, the mel generated is like: [-10.xxxx, -11.xxxx, ...]

OnceJune commented 3 years ago

sorry I forget to mel_ = np.power(10.0, mel_) Thanks for your impressive code:)