Closed turian closed 1 year ago
@turian ohh that is straightforward; in image to image ddpm, we just concat the input image along the channel dimension at the very beginning. there are other schemes where they can be separately encoded and meet at some bottleneck
if the audio were different, then i think you'd need to train a length predictor and sample from it
@turian what is your take on the results from this paper?
@turian for your example, i don't think you would need to do it in latent space. in fact, you probably just use a GAN with an all-convolutional network
@lucidrains Haven't gotten to study it yet. I'm very interested in diffusion for audio2audio when the audios are the same length, using the double channel strategy you proposed. Haven't had much success yet, because the training times from scratch are SO LONG.
Then the issue with using a pretrained model is that when you double the number of channels, you probably also want to increase the model capacity in other ways. There are some old black magic hacks for doing this sort of neural network brain transplant from one model to another. Anyway, I might be implementing them.
@turian there should be a lot of audio diffusion works out there that support this
you should ask the author of https://github.com/archinetai/audio-diffusion-pytorch to implement this, as it is a couple lines change, if he hasn't already
@turian i can add it to this repository too, for audio of same length. i just think for denoising audio it probably is not the best fit
A follow up question: I didn't find any condition on text. Where is it?
I'm curious how difficult it would be to get this model to support audio2audio training.
For example, the input is noisy speech and the output is denoised speech.
This basically assumes that we would have a finetuning step with (input, output) pairs.