Song composition is difficult and talent needed. Nowadays, there is more and more research to use learning algorithm especially deep learning to do jobs which require creativity, including composing music. In our project, we want to do implement an intelligent system which can transfer songs from one genre to another genre. This work is inspired by the fact that a lot of indie musicians cover popular songs into different musical styles such as cover rock ’n roll songs in the acoustic style or cover folk songs in the Jazz style.
In this project, we choose Jazz style to be our targeted genre because Jazz has several significant and unique attributes, like spontaneous tempo, lots of seventh chord and improvisation of notes. Even the people who don’t really know details about Jazz can easily recognize Jazz songs.
Our ultimate goal is training a machine learning model that knows some characteristics of Jazz songs and is able to transform the non-Jazz songs to Jazz ones.
When an artist covers a song, the melody of the song usually is retained and the background music is changed. Therefore, we mimic this process by separate a song into the main melody (content) and background music/harmony (style), and we focus on changing the background music. The background music has to follow a sequence of chords to be harmonic with the main melody. To create a reasonably scope for our project, we decide to focus on the Piano track of the background music.
Depend on the style, different ways to play chords (aka note combinations) produce different styles of music. Especially in Jazz, performers usually would like to improvise chords by playing some different notes. Playing a chord in different combinations will create different musical feelings. For example, E3 chord, there are at least 3 different expressions in song “Imagine”. Our approach is to learn a model that takes a list of chords as input and produces a jazz style note combinations.
Chord | Note Combination |
---|---|
E3 | {G in octave 3 , E in octave 3} E3-interval {E in octave 3 , D in octave 4 , G in octave 3} E3-minor-seventh {G in octave 3 , E in octave 3 , B in octave 3} E3-minor triad |
E3-minor-seventh | {E in octave 3, D in octave 4, G in octave 3, B in octave 3} {D in octave 4 , G in octave 4, E in octave 3} |
After breaking down our problem based on the above knowledge about music, we consider our problem as one of the translation problems (for example, translate English to Vietnamese). We decide to use a state of the art technique in this domain, which is Neural Network Translation (NMT) seq2seq model (Sutskever et al., 2014, Cho et al., 2014). This technique has already gotten great success in machine translation, speech recognition, and text summarization fields. A chord sequence is considered as the input language, and it is translated by NMT into a sequence of notes (output language) that should be played.
In this study, we use 2 representations of a chord, i.e., complete representation (E3-incomplete minor-seventh) and compact representation (E3). The objective is to find the note combination when playing a chord, e.g., E3 -> {E in octave 3 | D in octave 4 | G in octave 3}. After the note combinations for chords are found, we incorporate them into the original song to produce a jazz style song.
Methodology: We evaluate the performance of our algorithm using the midi jazz files. The experiments are conducted on a Linux machine 3.4GHz, 8Gb memory, and a GPU. To measure the goodness of the model, we report the bleu score and perplexity of the test set. The default hyperparameters can be found here.
Datasets: ~1500 jazz midi files are crawled. In that dataset, we use 802 files that have piano tracks for evaluation. We export chords and corresponding notes from midi files and separate chords into sentences with length L (default value L = 20 chords). After that, we put them to train, dev, and test datasets. Train and dev datasets are used for model training.
With default settings:
Train | Dev | Test | |
---|---|---|---|
Number of jazz files | 562 | 80 | 160 |
Number of sentences | 11049 | 1482 | 2897 |
Vocabulary
#Chords | #Notes | |
---|---|---|
Chord compact representation | 79 | 39837 |
Chord complete representation | 2269 | 39387 |
In this section, we show the results with the 2 types of chord representation: complete name and compact name. The compact name is a compact representation of complete name. Multiple chords with different complete names can have the same compact name.
Dev | Test | |
---|---|---|
Bleu | 10.5 | 8.4 |
Perplexity | 51871.42 | 71964.79 |
Result sample:
Original Song | Transformed Song | |
---|---|---|
Name | Can you feel my love tonight | Can you feel my love tonight - jazz version |
Music sheet (beginning part) | ||
Full Midi file | Can you feel my love tonight | Can you feel my love tonight jazz |
Dev | Test | |
---|---|---|
Bleu | 15.0 | 10.7 |
Perplexity | 21539.96 | 45704.38 |
Result sample:
Original Song | Transformed Song | |
---|---|---|
Name | Can you feel my love tonight | Can you feel my love tonight - jazz version 2 |
Music sheet (beginning part) | ||
Full Midi file | Can you feel my love tonight | Can you feel my love tonight jazz 2 |
Because using complete chord representation achieve better results, from now, we use it for the following experiments.
We separate each song (in train, dev, test datasets) to multiple sentences because each chord can depend on only some previous chords. We vary the length of a sentence from 5 to 20 chords. Blue score:
L = 5 | L = 10 | L = 20 | |
---|---|---|---|
Test blue score | 10.8 | 10.5 | 10.7 |
As we can see in the above table, we have similar results when using different lengths of a sentence. Below we present the detailed test results with L = 5 and L = 10. The result with L = 20 is already presented in the previous section:
Original Song | L = 5 | L = 10 | |
---|---|---|---|
Name | Can you feel my love tonight | Can you feel my love tonight - jazz5 | Can you feel my love tonight - jazz10 |
Music sheet (beginning part) | |||
Full Midi file | Can you feel my love tonight | Can you feel my love tonight - 5 | Can you feel my love tonight - 10 |
The full music sheets of songs and other transformed songs are available here
Result sample:
Original Song | Transformed song |
---|---|
Can you feel the love tonight midi | Can you feel the love tonight jazz, midi |
Imagine midi | Imagine jazz, midi |
All of me midi | All of me jazz, midi |
Nang am xa dan midi | Nang am xa dan jazz, midi |
Ngoi nha hanh phuc - Full house midi | Ngoi nha hanh phuc jazz, midi, |
Quay ve di midi | Quay ve di jazz, midi |
Set fire to the rain midi | Set fire to the rain jazz, midi |
Kiss the rain midi | Kiss the rain jazz midi |
Combination of transformed piano and bass, harmonica:
Imagine: midi
Because another important element in Jazz is the spontaneous durations, we would like to mimic how the Jazz performers do. Jazz performers are always playing the notes and chords spontaneously, for example, 4 single quarter notes will probably be played as one dot and dot and one dot and dot to express different emotion. If we can try to predict the how the Jazz song plays the chord in different durations, it would make the song more close to Jazz.
Our first thought is using the previous result to translate a chord sequence into a duration sequence. It means we would like the answer the following question: "At this moment, the performer wants to play a C chord, how long will he/she play the note combination?".
We extracted the chords-durations mapping to training our model. Unfortunately, even turning different hyper-parameters, the initial results still converged to one or two different durations. The possible reason for this result is that the vocabulary of chords contains more than 2000 words, but there are only 120 different durations. Furthermore, most chords have been played as 16th notes (0.25). It makes playing 16th note may always get a higher score. After 12000 iterations (8 hours) training, the input sentences are easy to be inferred to all 16th notes, which is not our expectation.
At this point, translating chords to durations by NMT is not a successful approach. In conclusion, the chords are not related to the note durations directly in the dataset, and the reason on music view is the performers they improvise the duration to notes not to chords, they keep chords constant. There is another approach to this problem by generating note durations of a whole music bar (similar to generating drum patterns). However, due to the time constraint, we put this as one of our future work.
Most of other music transfer works are belonging to two categories. In the first category, researchers tried to learn musical styles from samples to generate new and "random" songs of that style [1, 2, 3]. The authors also use musical knowledge to improve the quality of generated music pieces. In the second category, researchers tried to merge style the sound signals of a song (in WAV format) into another song [4, 5, 6] using techniques like WaveNet. However, this approach will require a large amount of data as well as resources for training. Rather than that, some simple transferring models give out very poor quality output [5].
In our approach, we have used some basic musical knowledge to transform the music transfer problem into sequence translation problem. We don't rely on any detailed music knowledge about Jazz, rather than that, we train model to be able to create Jazz felling. We believe our approach is the first of its kind.
In this project, we have successfully transformed songs to their jazz versions using chord translation. The chords in the original song are played by the jazz style combination of notes. We have also conducted experiments of generating jazz-style durations, however, it will require more exploration. Here, we note some directions can be explored in the future:
Scale up/down the key of the dataset into the same key (e.g. C Major) to reduce the size of language vocabulary, which in turns increase the size of our training data.
Apply our approach to other types of instruments.
Fine tune the parameters and try different algorithms/models enhance the result.
Divide the whole song into many pieces for 2 bars or 4 bars duration, because usually, the music in certain duration (like a whole music sentence) is relevant, that might be better to predict how the chord machine can learn to play.