seastar105 / pflow-encodec

Implementation of TTS model based on NVIDIA P-Flow TTS Paper
64 stars 5 forks source link

How to generate output variation? #5

Closed chunping-xt closed 2 months ago

chunping-xt commented 2 months ago

Thanks for your very carefully written codes, I can train the model smoothly and very efficiently, I don't have any problems during training. I have questions:

  1. How can the model generate 44kHz quality? Or do you plan to test with other vocoder with larger sampling rate than the current one?
  2. When I infer with the same text and prompt, it seems to generate the same result, are there parameters that change to generate variations, similar to the temprature of some transformers models, I expect style variation every time I generate?
  3. When I tried training with singing data, I noticed that the output had fairly even word durations, while singing data often has words with long durations. Does the text2latent_ratio parameter affect word length?
seastar105 commented 2 months ago

@chunping-xt

  1. This repo mainly aims to use Encodec, which generates 24Khz audio. so if you want to generate 44.1KHz audio, need to change vae compatible with 44.1KHz(dac maybe?) or using MelSpectrogram compatible 44.1KHz vocoder.
  2. Since model is diffusion model and generation starts from noise, it will generate different speech every time unless you set same seed and generator. but it may not change that much as expected.
  3. text2latent_ratio evenly expands predicted duraton's frame rate(50Hz) to output latent's frame rate(75Hz). So, it does not effect word length. I think GT duration is not that suitable for singing data, which trained only speech data. It may be better to try MAS or AR model to deal with dynamic data like singing voice.
Tera2Space commented 2 weeks ago

We can use any sample rate, because we only use encoder and decoder from Encodec. (works for me)

from audiolm_pytorch import EncodecWrapper
import torchaudio as ta
from einops import rearrange
import IPython.display as ipd
encodec = EncodecWrapper()
audio, sr = ta.load("") 
print(sr) # sr == 44100
codecs = encodec.model.encoder(rearrange(audio, f'b t -> b {encodec.model.channels} t'))
audio_gen = encodec.decode(codecs.transpose(1, 2))
ipd.display(ipd.Audio(audio_gen.detach().cpu().numpy()[0][0], rate=44100))