Closed FeelTheFonk closed 1 year ago
Wow, thanks for the feedback! This is exactly what I hoped to facilitate, by making it relatively easy to try things out and get early results on modest consumer grade hardware. Please feel free to share your results here. Have fun!
My pleasure! :)
These are some results from the main training (15 epochs/40 hours/500 tracks) run with an input size of 448x448 (DDPM, just looped 4 seconds segments):
https://github.com/teticio/audio-diffusion/assets/134219563/b6828b9b-4119-46b0-b11b-9ca742c4e239
https://github.com/teticio/audio-diffusion/assets/134219563/b1775567-cecb-4e63-a6d2-8c53a53bdc6b
https://github.com/teticio/audio-diffusion/assets/134219563/36167f80-2fa7-4e9e-a586-a6ef1c1108ea
The first 2 sounds are a good example for sampling material, the last one is from the variation generation function with a 'start_step' at 650
Please keep in mind that these are initial results and I plan to further refine and improve them as I gain access to more computational resources (and I'll soon, I can't wait to work on higher spectrogram resolutions).
Hello teticio! (and the community :smile:),
I'm a sound and music enthusiast, and I recently discovered your project on GitHub. I wanted to express my immense gratitude for your work and for sharing it. It's been a joy to experiment with the diffusers.
Below are some specific details about my experience with your project:
Dataset: I used 500 tracks; a hundred of them were my own production (spanning genres like minimalism, jazz, broken beat, and break beat), another hundred from some of my producer friends in a similar style, and the remaining 300 tracks were a curated collection of the best music used in DJ sets and podcasts. All files were in .wav format with a sample rate of 48000hz, and the Spectrogram dimensions were 448*448.
Hyperparameters: The training was conducted with the following settings:
This amounted to almost 40 hours of training time. The results were fascinating; while some of the generated sounds were unusable, some grooves (especially at the low end) were impressively clear and precise.
This training configuration led to the best results I've seen among all my trials, even with relatively few epochs and limited computational power. It seems that the key to the best possible outcome lies in ensuring perfect consistency of the dataset and prioritizing quality over quantity.
I'd love to share my audio results here (with your permission, of course). Currently, I'm running a new 20-epoch training primarily on jazz sounds and softer old-school trip-hop. I've drastically reduced the size of the dataset (down to 46 tracks) but fine-tuned the hyperparameters accordingly. I'm hopeful for satisfying results.
To conclude, I believe that the creative possibilities with substantial computational power could lead to extraordinary generations. I'm curious to know if anyone else has ventured far with substantial resources.
Once again, thank you! Your work has led me to explore a whole new world of creativity. I'm excited to see what you'll develop next.
(P.S.: I apologize for any issues with the formatting, I'm still getting used to GitHub, and I officially joined today)
Best, Fonk