teticio / audio-diffusion

Apply diffusion models using the new Hugging Face diffusers package to synthesize music instead of images.
GNU General Public License v3.0
707 stars 69 forks source link

[Little Feedback] Thank you! :) #39

Closed FeelTheFonk closed 1 year ago

FeelTheFonk commented 1 year ago

Hello teticio! (and the community :smile:),

I'm a sound and music enthusiast, and I recently discovered your project on GitHub. I wanted to express my immense gratitude for your work and for sharing it. It's been a joy to experiment with the diffusers.

Below are some specific details about my experience with your project:

This amounted to almost 40 hours of training time. The results were fascinating; while some of the generated sounds were unusable, some grooves (especially at the low end) were impressively clear and precise.

This training configuration led to the best results I've seen among all my trials, even with relatively few epochs and limited computational power. It seems that the key to the best possible outcome lies in ensuring perfect consistency of the dataset and prioritizing quality over quantity.

I'd love to share my audio results here (with your permission, of course). Currently, I'm running a new 20-epoch training primarily on jazz sounds and softer old-school trip-hop. I've drastically reduced the size of the dataset (down to 46 tracks) but fine-tuned the hyperparameters accordingly. I'm hopeful for satisfying results.

To conclude, I believe that the creative possibilities with substantial computational power could lead to extraordinary generations. I'm curious to know if anyone else has ventured far with substantial resources.

Once again, thank you! Your work has led me to explore a whole new world of creativity. I'm excited to see what you'll develop next.

(P.S.: I apologize for any issues with the formatting, I'm still getting used to GitHub, and I officially joined today)

Best, Fonk

teticio commented 1 year ago

Wow, thanks for the feedback! This is exactly what I hoped to facilitate, by making it relatively easy to try things out and get early results on modest consumer grade hardware. Please feel free to share your results here. Have fun!

FeelTheFonk commented 1 year ago

My pleasure! :)

These are some results from the main training (15 epochs/40 hours/500 tracks) run with an input size of 448x448 (DDPM, just looped 4 seconds segments):

https://github.com/teticio/audio-diffusion/assets/134219563/b6828b9b-4119-46b0-b11b-9ca742c4e239

https://github.com/teticio/audio-diffusion/assets/134219563/b1775567-cecb-4e63-a6d2-8c53a53bdc6b

https://github.com/teticio/audio-diffusion/assets/134219563/36167f80-2fa7-4e9e-a586-a6ef1c1108ea

The first 2 sounds are a good example for sampling material, the last one is from the variation generation function with a 'start_step' at 650

Please keep in mind that these are initial results and I plan to further refine and improve them as I gain access to more computational resources (and I'll soon, I can't wait to work on higher spectrogram resolutions).