Convert monophonic MIDI instrument to monophonic real instrument?

maciejsaw commented 5 years ago

I found your project and it seems related to my idea that I posted on a different repo here https://github.com/facebookresearch/music-translation/issues/5 I will appreciate if you take a look at the idea and see if it is possible to implement. It seems that in this project you tried to achieve similar results as I described there, but maybe you didn't think of converting monophonic midi to monophonic instruments like sax/clairnet/trumpet/violin. Trumpet is much more similar to human voice than piano, maybe you could revisit your project with this different assumption and it might be successful.

plunkgj commented 5 years ago

Your idea is possible, and already done by the paper I was trying to replicate here: https://arxiv.org/abs/1810.12247

According to the paper, they are replicating polyphonic piano music. Its still probably a few years away from being available in software however. A big issue is speed - without a fat GPU you can't render the audio in realtime, and if you are trying to play multiple midi rolls then this problem compounds. This would be bypassed by rendering audio from midi once, but you'd have to do this every time you change the midi roll. Some mix of classic midi rendering for prototyping, but AI for the final rendering would probably work out best.

Also, there is a lack of high quality midi+audio data existing for most instruments. The piano dataset used to train this network has 100+ hours of time, and was assembled over the course of several years. The midi data was extremely fine-aligned because it was being picked up from the performance piano directly, which was a professional-quality electronic piano. Its the best audio+midi dataset that exists unless its been one-upped since I stopped working on this project. Datasets with finely aligned midi will be much more expensive to create for non-piano instruments.

Training this network with monophonic music wouldn't fix the fundamental problem I was having in successfully training this network. Basically theres two neural networks, one which encodes midi, and one which generates audio samples using both the encoded midi and the previous audio samples as reference. The audio generation network is so powerful that it learns to completely ignore the midi encoding network. Making the midi signal simpler wouldn't change this, and counter-intuitively may make it worse. Similarity to human voice doesn't matter in this context since its all about forcing the audio network to pay attention to the midi network.

Let me know if theres anything else you want to talk about regarding this type of project, I agree that the possibility of having AI playing midi rolls for us would be an incredible improvement in audio synthesis. Basically gives anyone access to a professional orchestra.

maciejsaw commented 5 years ago

I have several ideas how we can overcome the problems (don't want to give up easily!).

Insight:

A skillful music producer is able to create a new melody by cutting and joining parts of trumpet recorded by a real player
The more pre-recorded audio phrases of a trumpet you have, the easier is to mangle the audio and achieve good realistic result (a real proof of this is this plugin https://www.kvraudio.com/product/liquid-trumpet-by-ueberschall)

Assumptions:

we should not convert midi to audio but audio to audio - therefore we don't need a dataset of combined audio&midi
we should not try to synthesize the sound but simply let the neural network find similar phrases and join them in a way so that it matches the original audio
we don't need real-time - a composer doesn't need a super high quality rendering while working, similarly as 3D artists don't see the rendered result until they render it

Rough algorithm for PoC:

Preparing a dataset

We need to prepare a recording of a real instrument - I suggest we start with one octave and C-major scale only and I can record it on children flute, the recording should have most of the notes transitions and several articulations
We convert the recording to a dataset of "phrases" of various length, split with different methods

Preprocessing the input audio:

We also use the recorded audio to create a sampler instrument that resembles the real instrument as closely as possible, simply map C-major notes to a scale in Logic EXS24 - I'm happy to do this for you
We record a "prototype" audio with MIDI sampler instrument and bounce it to audio
Split the audio it into "clips", for example every 2 seconds (or we might split them by each even/odd audio transient) - we should make sure we snap to zero-crossing to avoid clicks

Mapping the chunks to dataset

We use a tool like https://ml4a.github.io/guides/AudioTSNEViewer/ to match each "prototype" clip with the most similar "real" clip
We start building up the new result audio file by trying to match the longest possible phrase (with big enough dataset of hundreds of hours of solo music there's a big chance that a specific passage of notes was already recorded)
the algorithm should only match the clip if specific treshhold of similarity is achieved (especially matching right pitch is key)
We iterate over the input file and gradually lower down the length of a clip that we try to match to fill in the gaps,
We continue to shorten the clip length that we try to match and increase the threshold of similarity needed
As a result we get a list of clips and their timecode where they should start

Render final audio

We now need to combine all these "selected" clips with a so called "equal power" crossfade to avoid clicks

You can call it "rebuilding" or "enhancing" a MIDI Sampler-generated audio with pieces and chunks of a real audio recorded by a real instrumentalist with real note transitions, real articulations etc.

plunkgj / midi2wave

Convert monophonic MIDI instrument to monophonic real instrument? #1