lucidrains / naturalspeech2-pytorch

Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch
MIT License
1.26k stars 100 forks source link

audio2audio? #6

Closed turian closed 1 year ago

turian commented 1 year ago

I'm curious how difficult it would be to get this model to support audio2audio training.

For example, the input is noisy speech and the output is denoised speech.

This basically assumes that we would have a finetuning step with (input, output) pairs.

lucidrains commented 1 year ago

@turian ohh that is straightforward; in image to image ddpm, we just concat the input image along the channel dimension at the very beginning. there are other schemes where they can be separately encoded and meet at some bottleneck

if the audio were different, then i think you'd need to train a length predictor and sample from it

lucidrains commented 1 year ago

@turian what is your take on the results from this paper?

lucidrains commented 1 year ago

@turian for your example, i don't think you would need to do it in latent space. in fact, you probably just use a GAN with an all-convolutional network

turian commented 1 year ago

@lucidrains Haven't gotten to study it yet. I'm very interested in diffusion for audio2audio when the audios are the same length, using the double channel strategy you proposed. Haven't had much success yet, because the training times from scratch are SO LONG.

Then the issue with using a pretrained model is that when you double the number of channels, you probably also want to increase the model capacity in other ways. There are some old black magic hacks for doing this sort of neural network brain transplant from one model to another. Anyway, I might be implementing them.

lucidrains commented 1 year ago

@turian there should be a lot of audio diffusion works out there that support this

you should ask the author of https://github.com/archinetai/audio-diffusion-pytorch to implement this, as it is a couple lines change, if he hasn't already

lucidrains commented 1 year ago

@turian i can add it to this repository too, for audio of same length. i just think for denoising audio it probably is not the best fit

tbright17 commented 1 year ago

A follow up question: I didn't find any condition on text. Where is it?