p0p4k / pflowtts_pytorch

Unofficial implementation of NVIDIA P-Flow TTS paper
https://neurips.cc/virtual/2023/poster/69899
MIT License
198 stars 28 forks source link

descript/encodec is too slow in dataloader #14

Open vuong-ts opened 7 months ago

vuong-ts commented 7 months ago

Hi @p0p4k ,

I see that the process time of dac encode in dev/descript_codec branch is too slow on CPU in Dataloader. How can we speedup this process?

def batched_encodec(self, wav):
    with torch.no_grad():
         self.encodec.eval()
         wav = self.resampler(wav) # resample to 24khz
         signal = AudioSignal(wav, 24000)
         x = self.encodec.preprocess(signal.audio_data, signal.sample_rate)
         _, _, latents, _, _ = self.encodec.encode(x)
     return latents
p0p4k commented 7 months ago

Yes, it is supposed to be done in collate function or inside the model itself, so we can take advantage of batches. In this implementation, I think I am doing one at a time (?)

vuong-ts commented 7 months ago

Run self.encodec on GPU can speed up the process but I got CUDA error.

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
p0p4k commented 7 months ago

You can load a dac encoder on each of your gpu, then send data to corresponding gpu and calculate it. Another temporary option is to do it and save in a path file for all your audio files in preprocess and then use it for training.

vuong-ts commented 7 months ago

So, I loaded the DAC model into pflowTTS, but I haven't had success in training the entire pflowTTS with DAC:

vuong-ts commented 7 months ago

So, i manage to train pflow with dac encodec.

I use the following code to decode audio with dac.

dataset: ljspeech
epoch: 200
text: On one occasion a disturbance was raised which was not quelled until windows had been broken and forms and tables burnt.
with torch.no_grad():
    dac_encodec.eval()
    wav = resampler(wav) # resample to 24khz
    signal = AudioSignal(wav, 24000)
    x = dac_encodec.preprocess(signal.audio_data, signal.sample_rate).to(device)
    _, _, latents, _, _ = dac_encodec.encode(x)
#
output = synthesise(text, latents)
pred_latents = output["decoder_outputs"]
pred_latents = pred_latents.reshape(1, 32, 8, -1) # B x N x D x T
pred_latents = pred_latents.to(device)
#
z_q = 0
for i in range(32):
    z_q_i, indices = dac_encodec.quantizer.quantizers[i].decode_latents(pred_latents[:, i, :, :])
    z_q_i = dac_encodec.quantizer.quantizers[i].out_proj(z_q_i)
    z_q += z_q_i
#
# Synthesize text_to_speech
idp.Audio(dac_encodec.decode(z_q).squeeze(dim=0).detach().cpu().numpy(), rate=24_000)

The audio output is not as good as mel-spectrogram representation. https://drive.google.com/file/d/18hDs-mL8mqwmuVTQd8ZMfFsfWFsfxMW9/view

Can I ask your comment on this @p0p4k?

vuong-ts commented 7 months ago

@rafaelvalle Regarding the neural codec presentation, Can I ask you questions as you are one of the authors :blush:

  1. Have you tried to train Pflow with an audio codec code like VALL-E instead of mel-spectrogram?
  2. Is the training of neural codec representation significantly slower than that of mel-spectrogram?
p0p4k commented 7 months ago

@vuong-ts The latest meta paper AudioBox uses OT-CFM on encodec latents. But the twist is, pre-training the TTS model with lots of encodec data (~60k). Pflow tts without the loss-mask, MAS and text conditioning is almost equivalent to pre-training it. Then we can finetune on text conditioning. That might be the solution. Essentially give a masked wav -> masked latent -> train ot-cfm to predict the entire latent (like BERT) and then downstream it for TTS.