mustass / diffusion_models_for_speech

Deep Learning course project repository.
https://kurser.dtu.dk/course/02456
1 stars 0 forks source link

Start evaluating the model #9

Closed mustass closed 1 year ago

mustass commented 1 year ago

We have a model trained, what now?

mustass commented 1 year ago

If we are to adopt the inference script from our inspiration repo, then we need to do the following:

  1. Create a dataset & dataloader for predictions. Basically a thing that is going to return something equivalent to:

    training_noise_schedule = np.array(model.params.noise_schedule)
    inference_noise_schedule = np.array(model.params.inference_noise_schedule) if fast_sampling else training_noise_schedule
    
    talpha = 1 - training_noise_schedule
    talpha_cum = np.cumprod(talpha)
    
    beta = inference_noise_schedule
    alpha = 1 - beta
    alpha_cum = np.cumprod(alpha)
    
    T = []
    for s in range(len(inference_noise_schedule)):
      for t in range(len(training_noise_schedule) - 1):
        if talpha_cum[t+1] <= alpha_cum[s] <= talpha_cum[t]:
          twiddle = (talpha_cum[t]**0.5 - alpha_cum[s]**0.5) / (talpha_cum[t]**0.5 - talpha_cum[t+1]**0.5)
          T.append(t + twiddle)
          break
    T = np.array(T, dtype=np.float32)

This last T is what is going to be an element of a batch.

  1. Our Lightning module needs to implement a predict_step, that is going to return something equivalent to:

    if not model.params.unconditional:
      if len(spectrogram.shape) == 2:# Expand rank 2 tensors by adding a batch dimension.
        spectrogram = spectrogram.unsqueeze(0)
      spectrogram = spectrogram.to(device)
      audio = torch.randn(spectrogram.shape[0], model.params.hop_samples * spectrogram.shape[-1], device=device)
    else:
      audio = torch.randn(1, params.audio_len, device=device)
    noise_scale = torch.from_numpy(alpha_cum**0.5).float().unsqueeze(1).to(device)
    
    for n in range(len(alpha) - 1, -1, -1):
      c1 = 1 / alpha[n]**0.5
      c2 = beta[n] / (1 - alpha_cum[n])**0.5
      audio = c1 * (audio - c2 * model(audio, torch.tensor([T[n]], device=audio.device), spectrogram).squeeze(1))
      if n > 0:
        noise = torch.randn_like(audio)
        sigma = ((1.0 - alpha_cum[n-1]) / (1.0 - alpha_cum[n]) * beta[n])**0.5
        audio += sigma * noise
      audio = torch.clamp(audio, -1.0, 1.0)
    return audio, model.params.sample_rate

Lastly, we need to make it work effortlessly given a pre-trained model. This is quite easy.

mustass commented 1 year ago

22 Implemented synthesis but not during training. this could be moved to another issue.