teticio / audio-diffusion

Apply diffusion models using the new Hugging Face diffusers package to synthesize music instead of images.
GNU General Public License v3.0
723 stars 70 forks source link

Training for the purpose of Super res + denoising conditioned on artifacted data #52

Open Respaired opened 3 weeks ago

Respaired commented 3 weeks ago

Hey. Appreciate the wonderful work you're doing here. (and thanks for not leaving any issues open!)

I have a somewhat peculiar task i wish to handle, hopefully using the tools you kindly provided in this repo.

let's say I have a dataset of clean samples and a low quality - artifact version of that (by artifact i mean those robotic hiccups you usually hear in GAN based vocoders or highly compressed audio), i also want the model to handle the upsampling from 24khz to 48khz.

My idea was to train a diffusion model that uses the these artifacty samples but instead of reconstructing them, it'll try to reconstruct the original perfect ground truth samples. i mean kind of like :

loss_fn(x_noisy_with_artifacts, good_ground_truth) instead of the usual diffusion drill that is loss_fn(x_noisy_deconstructed_from_good_ground_truth, good_ground_truth )

So the model could be robust to these issues. to sum it up, it's basically an audio2audio problem. the tasks are super res + enhancement.

I'm concerned that if i train a model in the usual way, it'll try to reconstruct those artifacts as well during inference which is not what i want here.

Anyway, may i ask you to tell me what's the best way to do it?

btw, is there any plan to add DiT to this very cool repo of yours?

teticio commented 1 week ago

Thanks!

It's an interesting idea, but I think that the fact that the model (as it stands) is working with mel spectrograms, you will get artefacts from the inversion, unless you use a neural vocoder like HifiGAN. Personally, I would probably take a different approach and use some kind of codebook to compress like Meta's https://github.com/facebookresearch/encodec and train a model end to end to recover the original as best as possible. I think you will need to have a way to measure "best" that is differentiable and correlates well to human hearing. I imagine there is a fair amount of research on this point.

If you train the model on the noisy samples, it is possible it doesn't learn to reproduce the noise as such, but it will introduce its own artefacts.

What is DiT?

Respaired commented 1 week ago

Thank you for the response!

the problem is, i'm trying to fix the artifacts produced by gan vocoders (hifigan) lol that's an interesting approach. yeah people usually use PESQ as the metric, but unfortunately doesn't support audio over 16khz afaik.

how about a simple super-resolution task? it shouldn't need any of those.

DiT is diffusion transformers. there's quite a hype behind it these days.