r9y9 / wavenet_vocoder

WaveNet vocoder
https://r9y9.github.io/wavenet_vocoder/
Other
2.31k stars 500 forks source link

what is the option of the pre-trained model #46

Closed gudwns1215 closed 6 years ago

gudwns1215 commented 6 years ago

hi r9y9 thank you for sharing this wonderful program. I downloaded your pre-trained model and try to synthesize by typing this python synthesis.py checkpoint/lj_check.pth generated/test_awb --conditional=./LJSpeech-1.1/data/ljspeech-mel-00001.npy

but the result is not good lj_check.wav.zip how do i get same voice as you show in https://r9y9.github.io/wavenet_vocoder/. ??

r9y9 commented 6 years ago

I changed audio feature extraction pipeline a bit since I trained the model used to generate sounds at https://r9y9.github.io/wavenet_vocoder/, so you will need to adjust it. Please checkout at https://github.com/r9y9/wavenet_vocoder/commit/489e6fa92eda9ecf5b953b2783d5975d2fdee27a and then start from extracting mel-spectrogram.

gudwns1215 commented 6 years ago

hi r9y9 i checked https://github.com/r9y9/wavenet_vocoder/commit/489e6fa92eda9ecf5b953b2783d5975d2fdee27a but there is no different change in ljspeech hparams +| key | value | +|---------------------------------|------------------------------------------------------| +| Data | LJSpeech (12522 for training, 578 for testing) | +| Input type | 16-bit linear PCM | +| Sampling frequency | 22.5kHz | +| Local conditioning | 80-dim mel-spectrogram | +| Hop size | 256 | +| Global conditioning | N/A | +| Total layers | 24 | +| Num cycles | 4 | +| Residual / Gate / Skip-out channels | 512 / 512 / 256 | +| Receptive field (samples / ms) | 505 / 22.9 | +| Numer of mixtures | 10 | +| Number of upsampling layers | 4 | all params are the same in my hparams my hparams:

name="wavenet_vocoder",
builder="wavenet",
input_type="raw",
quantize_channels=65536,

sample_rate=22050,

silence_threshold=2,
num_mels=80,
fmin=125,
fmax=7600,
fft_size=1024,

hop_size=256,
frame_shift_ms=None,
min_level_db=-100,
ref_level_db=20,

rescaling=True,
rescaling_max=0.999,

allow_clipping_in_normalization=True,

log_scale_min=float(np.log(1e-14)),

out_channels=10 * 3,
layers=24,
stacks=4,
residual_channels=512,
gate_channels=512, 
skip_out_channels=256,
dropout=1 - 0.95,
kernel_size=3,

weight_normalization=True,

cin_channels=80,

upsample_conditional_features=True,

upsample_scales=[4, 4, 4, 4],

freq_axis_kernel_size=3,

gin_channels=-1,  
n_speakers=7,  

pin_memory=True,
num_workers=2,

test_size=0.0441, 
test_num_samples=None,
random_state=1234,

batch_size=2,
adam_beta1=0.9,
adam_beta2=0.999,
adam_eps=1e-8,
initial_learning_rate=1e-3,

lr_schedule="noam_learning_rate_decay",
lr_schedule_kwargs={},  # {"anneal_rate": 0.5, "anneal_interval": 50000},
nepochs=2000,
weight_decay=0.0,
clip_thresh=-1,

max_time_sec=None,
max_time_steps=8000,

exponential_moving_average=True,

ema_decay=0.9999,

checkpoint_interval=10000,
train_eval_interval=10000,

test_eval_epoch_interval=5,
save_optimizer_state=True,
r9y9 commented 6 years ago

I mean https://github.com/r9y9/wavenet_vocoder/blob/489e6fa92eda9ecf5b953b2783d5975d2fdee27a/audio.py#L126-L127 was changed to https://github.com/r9y9/wavenet_vocoder/blob/2bf9e78fdee5aef16a63747c82691877fa70c413/audio.py#L127-L129 at some point, which makes difference.

gudwns1215 commented 6 years ago

aha! thanks! it works!

r9y9 commented 6 years ago

Glad to hear that:)

r9y9 commented 6 years ago

This should be fixed by https://github.com/r9y9/wavenet_vocoder/commit/3718c6d2bc5f691d3f754e7e7393c8a8200a8b97 and https://github.com/r9y9/wavenet_vocoder/commit/e0863c900fc639124221370992d1c7842545667c.

harirawat commented 5 years ago

@gudwns1215 did you generate ljspeech-mel-00001.npy by again pre-processing the LJSpeech dataset?