Open Randy-H0 opened 3 years ago
We are probably 5-10 years off from end to end song generation, isn't it the equivalent of an AI writing a book? The good news is - can augment technologies today. There was some efforts recently by @robgon-art to create a midi song with structure based off GPT + midi https://github.com/robgon-art/ai-tunes
Any updates or tests on this with A100 GPU?
This is for anyone who is interested in trying to make something like jukebox.
One of the first things I noticed with jukebox is how it was coherent in lyrics and generated pretty good songs. I also noticed that it looked about 24 seconds back from where it is generating, it's like priming but from it's own data it just generated. This leaves the neural network with an issue, it can't create structures of a song properly. It creates the first chorus and by the time it is at the second chorus, it only knows about 24 seconds what happened before, this isn't optimal because it makes different choruses in the same song.
A possible solution for this is to use the pre-made labels in the lyrics. [Verse], [Chorus], [pre-Chorus], etc... You configure it in a way that allows the neural network to put like 4-6 seconds of the verse in it's memory and maybe 10-12 seconds of the chorus, you use something that predicts the structure of a song and then you use the audio it has as a reference.
Another thing is speed, currently, on google colab it takes about 1 hour per minute, which is slow and thus cannot be used in interactive applications stated in the blog.
A solution could be to try to optimize the neural network or use a very, very fast GPU, like a a100. I've heard that this gpu can upsample 3 samples 1 minute fully in about 1 hour. This means that synthesising the bottom layer would probably take even less, so this could be dropped down to 1-2 hours. But currently, the tesla a100 is kinda expensive and not available for a lot of people.
Another thing is how coherent this neural network is. Yes, it's very flexible and is able to say words it has never heard of because of how it was trained, but the problem is that all of the models were trained differently. As example, the Ed Sheeran model performs very, very good but I didn't get as good of results with the Shawn Mendez model or Ariana grande model. Yes, the results still were pretty good but not that good, this isn't because the neural network is an ed sheeran fan, no, these models were just trained on less, different or complex information which is the reality of music.
To fix this, you can try training or finetuning your own model with a bunch of songs and try to train it longer until you are satisfied.
Please note, I'm not a machine learning expert, I merely play around with neural networks and just try to have fun. I'm sure that a lot of this information can be wrong or false, if it is, please correct me!