Scaling Up - Githubissues

First off-- fantastic achievement. The generation is really fast, even on HuggingFace. However, I'd be willing to wait, and run it on higher end hardware, for a better quality generation, even if that was diminishing returns. I have a few questions, if you don't mind--

Are there any parameters that can be tuned that won't cause instability in the model?
Obviously sample rate has an impact as the 44.1khz definitely sounds better than the 22khz -- would a higher sample rate or bit rate potentiall improve quality?
Have you experimented at all with conditioning or guidance? Would it be difficult to adapt the model to text2audio or mixing tags in the future?
Would training longer yield anything?

Any help appreciated. Will try to tackle it on my own. If we manage to train a big model, can post it on HuggingFace and such!

Are there any parameters that can be tuned that won't cause instability in the model?

You can increase the --base_channels parameters to scale up the system, and if you notice unstable behavior, lowering the learning rate to 0.00007 or 0.00005 should solve it.

Obviously sample rate has an impact as the 44.1khz definitely sounds better than the 22khz -- would a higher sample rate or bit rate potentially improve quality?

No, I don't think so. To improve quality even further the audio autoencoder would need to improve, by better reconstructing the input samples. Right now it is far from perfect, but still quite acceptable considering the extremely high time compression ratio (4096x).

Have you experimented at all with conditioning or guidance? Would it be difficult to adapt the model to text2audio or mixing tags in the future?

I experimented with tempo conditioning in particular, which works fairly well. It would be totally possible to condition musika on text or other tags, but it would require orders of magnitude more training data than what I used.

Would training longer yield anything?

In my experiments I trained the system for much longer that what is needed (I recommend around 2 million iterations), but quality actually can degrade for longer training times, at least for some music domains.

Hope this helps, and thank you for posting!

marcoppasini / musika

Scaling Up #6