Closed gudwns1215 closed 6 years ago
I changed audio feature extraction pipeline a bit since I trained the model used to generate sounds at https://r9y9.github.io/wavenet_vocoder/, so you will need to adjust it. Please checkout at https://github.com/r9y9/wavenet_vocoder/commit/489e6fa92eda9ecf5b953b2783d5975d2fdee27a and then start from extracting mel-spectrogram.
hi r9y9 i checked https://github.com/r9y9/wavenet_vocoder/commit/489e6fa92eda9ecf5b953b2783d5975d2fdee27a but there is no different change in ljspeech hparams +| key | value | +|---------------------------------|------------------------------------------------------| +| Data | LJSpeech (12522 for training, 578 for testing) | +| Input type | 16-bit linear PCM | +| Sampling frequency | 22.5kHz | +| Local conditioning | 80-dim mel-spectrogram | +| Hop size | 256 | +| Global conditioning | N/A | +| Total layers | 24 | +| Num cycles | 4 | +| Residual / Gate / Skip-out channels | 512 / 512 / 256 | +| Receptive field (samples / ms) | 505 / 22.9 | +| Numer of mixtures | 10 | +| Number of upsampling layers | 4 | all params are the same in my hparams my hparams:
name="wavenet_vocoder",
builder="wavenet",
input_type="raw",
quantize_channels=65536,
sample_rate=22050,
silence_threshold=2,
num_mels=80,
fmin=125,
fmax=7600,
fft_size=1024,
hop_size=256,
frame_shift_ms=None,
min_level_db=-100,
ref_level_db=20,
rescaling=True,
rescaling_max=0.999,
allow_clipping_in_normalization=True,
log_scale_min=float(np.log(1e-14)),
out_channels=10 * 3,
layers=24,
stacks=4,
residual_channels=512,
gate_channels=512,
skip_out_channels=256,
dropout=1 - 0.95,
kernel_size=3,
weight_normalization=True,
cin_channels=80,
upsample_conditional_features=True,
upsample_scales=[4, 4, 4, 4],
freq_axis_kernel_size=3,
gin_channels=-1,
n_speakers=7,
pin_memory=True,
num_workers=2,
test_size=0.0441,
test_num_samples=None,
random_state=1234,
batch_size=2,
adam_beta1=0.9,
adam_beta2=0.999,
adam_eps=1e-8,
initial_learning_rate=1e-3,
lr_schedule="noam_learning_rate_decay",
lr_schedule_kwargs={}, # {"anneal_rate": 0.5, "anneal_interval": 50000},
nepochs=2000,
weight_decay=0.0,
clip_thresh=-1,
max_time_sec=None,
max_time_steps=8000,
exponential_moving_average=True,
ema_decay=0.9999,
checkpoint_interval=10000,
train_eval_interval=10000,
test_eval_epoch_interval=5,
save_optimizer_state=True,
I mean https://github.com/r9y9/wavenet_vocoder/blob/489e6fa92eda9ecf5b953b2783d5975d2fdee27a/audio.py#L126-L127 was changed to https://github.com/r9y9/wavenet_vocoder/blob/2bf9e78fdee5aef16a63747c82691877fa70c413/audio.py#L127-L129 at some point, which makes difference.
aha! thanks! it works!
Glad to hear that:)
@gudwns1215 did you generate ljspeech-mel-00001.npy by again pre-processing the LJSpeech dataset?
hi r9y9 thank you for sharing this wonderful program. I downloaded your pre-trained model and try to synthesize by typing this python synthesis.py checkpoint/lj_check.pth generated/test_awb --conditional=./LJSpeech-1.1/data/ljspeech-mel-00001.npy
but the result is not good lj_check.wav.zip how do i get same voice as you show in https://r9y9.github.io/wavenet_vocoder/. ??