Open maciejsaw opened 5 years ago
Your idea is possible, and already done by the paper I was trying to replicate here: https://arxiv.org/abs/1810.12247
According to the paper, they are replicating polyphonic piano music. Its still probably a few years away from being available in software however. A big issue is speed - without a fat GPU you can't render the audio in realtime, and if you are trying to play multiple midi rolls then this problem compounds. This would be bypassed by rendering audio from midi once, but you'd have to do this every time you change the midi roll. Some mix of classic midi rendering for prototyping, but AI for the final rendering would probably work out best.
Also, there is a lack of high quality midi+audio data existing for most instruments. The piano dataset used to train this network has 100+ hours of time, and was assembled over the course of several years. The midi data was extremely fine-aligned because it was being picked up from the performance piano directly, which was a professional-quality electronic piano. Its the best audio+midi dataset that exists unless its been one-upped since I stopped working on this project. Datasets with finely aligned midi will be much more expensive to create for non-piano instruments.
Training this network with monophonic music wouldn't fix the fundamental problem I was having in successfully training this network. Basically theres two neural networks, one which encodes midi, and one which generates audio samples using both the encoded midi and the previous audio samples as reference. The audio generation network is so powerful that it learns to completely ignore the midi encoding network. Making the midi signal simpler wouldn't change this, and counter-intuitively may make it worse. Similarity to human voice doesn't matter in this context since its all about forcing the audio network to pay attention to the midi network.
Let me know if theres anything else you want to talk about regarding this type of project, I agree that the possibility of having AI playing midi rolls for us would be an incredible improvement in audio synthesis. Basically gives anyone access to a professional orchestra.
I have several ideas how we can overcome the problems (don't want to give up easily!).
Insight:
Assumptions:
Rough algorithm for PoC:
Preparing a dataset
Preprocessing the input audio:
Mapping the chunks to dataset
Render final audio
You can call it "rebuilding" or "enhancing" a MIDI Sampler-generated audio with pieces and chunks of a real audio recorded by a real instrumentalist with real note transitions, real articulations etc.
I found your project and it seems related to my idea that I posted on a different repo here https://github.com/facebookresearch/music-translation/issues/5 I will appreciate if you take a look at the idea and see if it is possible to implement. It seems that in this project you tried to achieve similar results as I described there, but maybe you didn't think of converting monophonic midi to monophonic instruments like sax/clairnet/trumpet/violin. Trumpet is much more similar to human voice than piano, maybe you could revisit your project with this different assumption and it might be successful.