Closed faad3 closed 2 years ago
What model are you looking to train?
To save time, here is the full step configuration for the AR model:
steps:
gpt_train:
training: gpt
loss_log_buffer: 500
# Generally follows the recipe from the DALLE paper.
optimizer: adamw_zero
optimizer_params:
lr: !!float 1e-4
weight_decay: !!float 1e-2
beta1: 0.9
beta2: 0.96
clip_grad_eps: 4
injectors:
paired_to_mel:
type: torch_mel_spectrogram
mel_norm_file: ../experiments/clips_mel_norms.pth
in: wav
out: paired_mel
paired_cond_to_mel:
type: for_each
subtype: torch_mel_spectrogram
mel_norm_file: ../experiments/clips_mel_norms.pth
in: conditioning
out: paired_conditioning_mel
to_codes:
type: discrete_token
in: paired_mel
out: paired_mel_codes
paired_fwd_text:
type: generator
generator: gpt
in: [paired_conditioning_mel, padded_text, text_lengths, paired_mel_codes, wav_lengths]
out: [loss_text_ce, loss_mel_ce, logits]
losses:
text_ce:
type: direct
weight: .01
key: loss_text_ce
mel_ce:
type: direct
weight: 1
key: loss_mel_ce
And here it is for the diffusion model:
steps:
generator:
training: generator
loss_log_buffer: 2000
step_outputs: [loss]
optimizer: adamw
optimizer_params:
lr: !!float 1e-4
weight_decay: 0.001
beta1: 0.9
beta2: 0.999
clip_grad_eps: 1.0
injectors:
to_mel:
type: torch_mel_spectrogram
mel_norm_file: ../experiments/clips_mel_norms.pth
in: wav
out: mel
resample_wav:
type: audio_resample
in: wav
out: wav_for_vocoder
input_sample_rate: 22050
output_sample_rate: 24000
tacotron_mel:
type: mel_spectrogram
mel_fmax: 12000
sampling_rate: 24000
n_mel_channels: 100
# Only normalize the MEL target, because the diffuser specifically cares about it.
do_normalization: true
in: wav_for_vocoder
out: target_mel
resample_cond:
type: for_each
subtype: audio_resample
input_sample_rate: 22050
output_sample_rate: 24000
in: conditioning
out: conditioning_for_vocoder
cond_to_mel:
type: for_each
subtype: mel_spectrogram
mel_fmax: 12000
sampling_rate: 24000
n_mel_channels: 100
in: conditioning_for_vocoder
out: cond_mel
produce_latents:
type: gpt_voice_latent
gpt_path: ../experiments/finetune_gpt_unified_large_kennedy/models/800_gpt_ema.pth
in: wav
conditioning_clip: conditioning
text: padded_text
text_lengths: text_lengths
input_lengths: wav_lengths
out: gpt_latent
diffusion:
type: gaussian_diffusion
in: target_mel
generator: generator
beta_schedule:
schedule_name: linear
num_diffusion_timesteps: 4000
diffusion_args:
model_mean_type: epsilon
model_var_type: learned_range
loss_type: mse
sampler_type: uniform
model_input_keys:
aligned_conditioning: gpt_latent
conditioning_input: cond_mel
return_code_pred: true
extra_model_output_keys: [mel_pred]
out: loss
out_key_vb_loss: vb_loss
out_key_x_start: x_start_pred
losses:
diffusion_loss:
after: 500
type: direct
weight: 1
key: loss
var_loss:
after: 500
type: direct
weight: 1
key: vb_loss
mel_surrogate:
type: pix
weight: 1
criterion: l2
real: target_mel
fake: mel_pred
Note that I have not provided (and will not provide) some of the things that you would need to make these step configs work, notably the DVAE. There are also some "weird" things like transitioning from 22kHz audio to 24kHz audio which were driven by my late stage decision to use a vocoder.
I strongly recommend you do not actually attempt to train your models with DLAS. It is a very rough sandbox I have built and maintain for my personal use. It is not going to be fun to get working for someone else. I would highly recommend doing your training in something better supported like pytorch lightning or fairscale. Hopefully the above configs can help you decipher what the pipeline and loss structure looks like.
wow thanks!
To save time, here is the full step configuration for the AR model:
steps: gpt_train: training: gpt loss_log_buffer: 500 # Generally follows the recipe from the DALLE paper. optimizer: adamw_zero optimizer_params: lr: !!float 1e-4 weight_decay: !!float 1e-2 beta1: 0.9 beta2: 0.96 clip_grad_eps: 4 injectors: paired_to_mel: type: torch_mel_spectrogram mel_norm_file: ../experiments/clips_mel_norms.pth in: wav out: paired_mel paired_cond_to_mel: type: for_each subtype: torch_mel_spectrogram mel_norm_file: ../experiments/clips_mel_norms.pth in: conditioning out: paired_conditioning_mel to_codes: type: discrete_token in: paired_mel out: paired_mel_codes paired_fwd_text: type: generator generator: gpt in: [paired_conditioning_mel, padded_text, text_lengths, paired_mel_codes, wav_lengths] out: [loss_text_ce, loss_mel_ce, logits] losses: text_ce: type: direct weight: .01 key: loss_text_ce mel_ce: type: direct weight: 1 key: loss_mel_ce
And here it is for the diffusion model:
steps: generator: training: generator loss_log_buffer: 2000 step_outputs: [loss] optimizer: adamw optimizer_params: lr: !!float 1e-4 weight_decay: 0.001 beta1: 0.9 beta2: 0.999 clip_grad_eps: 1.0 injectors: to_mel: type: torch_mel_spectrogram mel_norm_file: ../experiments/clips_mel_norms.pth in: wav out: mel resample_wav: type: audio_resample in: wav out: wav_for_vocoder input_sample_rate: 22050 output_sample_rate: 24000 tacotron_mel: type: mel_spectrogram mel_fmax: 12000 sampling_rate: 24000 n_mel_channels: 100 # Only normalize the MEL target, because the diffuser specifically cares about it. do_normalization: true in: wav_for_vocoder out: target_mel resample_cond: type: for_each subtype: audio_resample input_sample_rate: 22050 output_sample_rate: 24000 in: conditioning out: conditioning_for_vocoder cond_to_mel: type: for_each subtype: mel_spectrogram mel_fmax: 12000 sampling_rate: 24000 n_mel_channels: 100 in: conditioning_for_vocoder out: cond_mel produce_latents: type: gpt_voice_latent gpt_path: ../experiments/finetune_gpt_unified_large_kennedy/models/800_gpt_ema.pth in: wav conditioning_clip: conditioning text: padded_text text_lengths: text_lengths input_lengths: wav_lengths out: gpt_latent diffusion: type: gaussian_diffusion in: target_mel generator: generator beta_schedule: schedule_name: linear num_diffusion_timesteps: 4000 diffusion_args: model_mean_type: epsilon model_var_type: learned_range loss_type: mse sampler_type: uniform model_input_keys: aligned_conditioning: gpt_latent conditioning_input: cond_mel return_code_pred: true extra_model_output_keys: [mel_pred] out: loss out_key_vb_loss: vb_loss out_key_x_start: x_start_pred losses: diffusion_loss: after: 500 type: direct weight: 1 key: loss var_loss: after: 500 type: direct weight: 1 key: vb_loss mel_surrogate: type: pix weight: 1 criterion: l2 real: target_mel fake: mel_pred
Note that I have not provided (and will not provide) some of the things that you would need to make these step configs work, notably the DVAE. There are also some "weird" things like transitioning from 22kHz audio to 24kHz audio which were driven by my late stage decision to use a vocoder.
I strongly recommend you do not actually attempt to train your models with DLAS. It is a very rough sandbox I have built and maintain for my personal use. It is not going to be fun to get working for someone else. I would highly recommend doing your training in something better supported like pytorch lightning or fairscale. Hopefully the above configs can help you decipher what the pipeline and loss structure looks like.
@neonbjb hi, would you like to share the config file for CLVP model, thanks a lot.
Hello! I found the GptVoiceLatentInjector injector. It looks like it's for inference, anyway it's not difficult to return losses using it. But I still wanted to clarify which injector you used for training. Because I also found a GptTtsDataset dataset that returns quantized_mels, as I understand it, these are mels pre-processed by dVAE, but GptVoiceLatentInjector takes wav as input.