Song composition is difficult and talent needed. Nowadays, there is more and more research to use learning algorithm especially deep learning to do jobs which require creativity, including composing music. In our project, we want to do implement an intelligent system which can transfer songs from one genre to another genre. This work is inspired by the fact that a lot of indie musicians cover popular songs into different musical styles such as cover rock ’n roll songs in the acoustic style or cover folk songs in the Jazz style.
In this project, we choose Jazz style to be our targeted genre because Jazz has several significant and unique attributes, like spontaneous tempo, lots of seventh chord and improvisation of notes. Even the people who don’t really know details about Jazz can easily recognize Jazz songs.
Problem Formulation
Our ultimate goal is training a machine learning model that knows some characteristics of Jazz songs and is able to transform the non-Jazz songs to Jazz ones.
2. Proposed Approach
When an artist covers a song, the melody of the song usually is retained and the background music is changed. Therefore, we mimic this process by separate a song into the main melody (content) and background music/harmony (style), and we focus on changing the background music. The background music has to follow a sequence of chords to be harmonic with the main melody. To create a reasonably scope for our project, we decide to focus on the Piano track of the background music.
Depend on the style, different ways to play chords (aka note combinations) produce different styles of music. Especially in Jazz, performers usually would like to improvise chords by playing some different notes. Playing a chord in different combinations will create different musical feelings. For example, E3 chord, there are at least 3 different expressions in song “Imagine”. Our approach is to learn a model that takes a list of chords as input and produces a jazz style note combinations.
Chord
Note Combination
E3
{G in octave 3 , E in octave 3} E3-interval {E in octave 3 , D in octave 4 , G in octave 3} E3-minor-seventh {G in octave 3 , E in octave 3 , B in octave 3} E3-minor triad
E3-minor-seventh
{E in octave 3, D in octave 4, G in octave 3, B in octave 3} {D in octave 4 , G in octave 4, E in octave 3}
After breaking down our problem based on the above knowledge about music, we consider our problem as one of the translation problems (for example, translate English to Vietnamese). We decide to use a state of the art technique in this domain, which is Neural Network Translation (NMT) seq2seq model (Sutskever et al., 2014, Cho et al., 2014). This technique has already gotten great success in machine translation, speech recognition, and text summarization fields. A chord sequence is considered as the input language, and it is translated by NMT into a sequence of notes (output language) that should be played.
In this study, we use 2 representations of a chord, i.e., complete representation (E3-incomplete minor-seventh) and compact representation (E3). The objective is to find the note combination when playing a chord, e.g., E3 -> {E in octave 3 | D in octave 4 | G in octave 3}. After the note combinations for chords are found, we incorporate them into the original song to produce a jazz style song.
3. Experiments
Methodology: We evaluate the performance of our algorithm using the midi jazz files. The experiments are conducted on a Linux machine 3.4GHz, 8Gb memory, and a GPU. To measure the goodness of the model, we report the bleu score and perplexity of the test set. The default hyperparameters can be found here.
Datasets: ~1500 jazz midi files are crawled. In that dataset, we use 802 files that have piano tracks for evaluation. We export chords and corresponding notes from midi files and separate chords into sentences with length L (default value L = 20 chords). After that, we put them to train, dev, and test datasets. Train and dev datasets are used for model training.
With default settings:
Train
Dev
Test
Number of jazz files
562
80
160
Number of sentences
11049
1482
2897
Vocabulary
#Chords
#Notes
Chord compact representation
79
39837
Chord complete representation
2269
39387
3.1. Varying the chord representation.
In this section, we show the results with the 2 types of chord representation: complete name and simple name. The simple name is a more compact representation. Multiple chords with different complete names can have the same simple name.
Compact chord representation: The bleu and perplexity are as followings
complete chord representation: The bleu and perplexity are as followings. The result shows that the complete representation gives better results than the compact representation because it provides more details about the chords.
Because using complete chord representation achieve better results, from now, we use it for the following experiments.
3.2 Varying the sentence length
We separate each song (in train, dev, test datasets) to multiple sentences because each chord can depend on only some previous chords. We vary the length of a sentence from 5 to 20 chords.
Blue score:
L = 5
L = 10
L = 20
Test blue score
10.8
10.5
10.7
As we can see in the above table, we have similar results when using different lengths of a sentence.
Below we present the detailed test results with L = 5 and L = 10. The result with L = 20 is already presented in the previous section:
Because another important element in Jazz is the spontaneous durations, we would like to mimic how the Jazz performers do. Jazz performers are always playing the notes and chords spontaneously, for example, 4 single quarter notes will probably be played as one dot and dot and one dot and dot to express different emotion. If we can try to predict the how the Jazz song plays the chord in different durations, it would make the song more close to Jazz.
Our first thought is using the previous result to translate a chord sequence into a duration sequence. It means we would like the answer the following question: "At this moment, the performer wants to play a C chord, how long will he/she play the note combination?".
We extracted the chords-durations mapping to training our model. Unfortunately, even turning different hyper-parameters, the initial results still converged to one or two different durations. The possible reason for this result is that the vocabulary of chords contains more than 2000 words, but there are only 120 different durations. Furthermore, most chords have been played as 16th notes (0.25). It makes playing 16th note may always get a higher score. After 12000 iterations (8 hours) training, the input sentences are easy to be inferred to all 16th notes, which is not our expectation.
At this point, translating chords to durations by NMT is not a successful approach. In conclusion, the chords are not related to the note durations directly in the dataset. There is another approach to this problem by generating note durations of a whole music bar (similar to generating drum patterns). However, due to the time constraint, we put this as one of our future work.
4. Related Work
Most of other music transfer works are belonging to two categories. In the first category, researchers tried to learn musical styles from samples to generate new and "random" songs of that style [1, 2, 3]. The authors also use musical knowledge to improve the quality of generated music pieces. In the second category, researchers tried to merge style the sound signals of a song (in WAV format) into another song [4, 5, 6] using techniques like WaveNet. However, this approach will require a large amount of data as well as resources for training. Rather than that, some simple transferring models give out very poor quality output [5].
In our approach, we have used some basic musical knowledge to transform the music transfer problem into sequence translation problem. We don't rely on any detailed music knowledge about Jazz, rather than that, we train model to be able to create Jazz felling. We believe our approach is the first of its kind.
In this project, we have successfully transformed songs to their jazz versions using chord translation. The chords in the original song are played by the jazz style combination of notes. We have also conducted experiments of generating jazz-style durations, however, it will require more exploration. Here, we note some directions can be explored in the future:
Scale up/down the key of the dataset into the same key (e.g. C Major) to reduce the size of language vocabulary, which in turns increase the size of our training data.
Apply our approach to other types of instruments.
Fine tune the parameters and try different algorithms/models enhance the result.
Divide the whole song into many pieces for 2 bars or 4 bars duration, because usually, the music in certain duration (like a whole music sentence) is relevant, that might be better to predict how the chord machine can learn to play.
Duc Le: clean up raw data to get chord and note sequences, train the NMT model based on the collected data.
Luan Tran: extract the raw representation of songs (notes, chords, tempo) from midi files, training the NMT models, regenerate midi output files by integrating the model’s output.
Vic Chen: collect jazz midi files, separate melodies, and background, cleanup generated jazz sample from the model, training models and analysis for tempo experiments.
CSCI599 Deep Learning - Final Report
Music Style Transfer project
Team: VicDucLuan(Duc Le, Luan Tran, Vic Chen)
1. Introduction
Song composition is difficult and talent needed. Nowadays, there is more and more research to use learning algorithm especially deep learning to do jobs which require creativity, including composing music. In our project, we want to do implement an intelligent system which can transfer songs from one genre to another genre. This work is inspired by the fact that a lot of indie musicians cover popular songs into different musical styles such as cover rock ’n roll songs in the acoustic style or cover folk songs in the Jazz style.
In this project, we choose Jazz style to be our targeted genre because Jazz has several significant and unique attributes, like spontaneous tempo, lots of seventh chord and improvisation of notes. Even the people who don’t really know details about Jazz can easily recognize Jazz songs.
2. Proposed Approach
When an artist covers a song, the melody of the song usually is retained and the background music is changed. Therefore, we mimic this process by separate a song into the main melody (content) and background music/harmony (style), and we focus on changing the background music. The background music has to follow a sequence of chords to be harmonic with the main melody. To create a reasonably scope for our project, we decide to focus on the Piano track of the background music.
Depend on the style, different ways to play chords (aka note combinations) produce different styles of music. Especially in Jazz, performers usually would like to improvise chords by playing some different notes. Playing a chord in different combinations will create different musical feelings. For example, E3 chord, there are at least 3 different expressions in song “Imagine”. Our approach is to learn a model that takes a list of chords as input and produces a jazz style note combinations.
{E in octave 3 , D in octave 4 , G in octave 3} E3-minor-seventh
{G in octave 3 , E in octave 3 , B in octave 3} E3-minor triad
{D in octave 4 , G in octave 4, E in octave 3}
After breaking down our problem based on the above knowledge about music, we consider our problem as one of the translation problems (for example, translate English to Vietnamese). We decide to use a state of the art technique in this domain, which is Neural Network Translation (NMT) seq2seq model (Sutskever et al., 2014, Cho et al., 2014). This technique has already gotten great success in machine translation, speech recognition, and text summarization fields. A chord sequence is considered as the input language, and it is translated by NMT into a sequence of notes (output language) that should be played.
In this study, we use 2 representations of a chord, i.e., complete representation (E3-incomplete minor-seventh) and compact representation (E3). The objective is to find the note combination when playing a chord, e.g., E3 -> {E in octave 3 | D in octave 4 | G in octave 3}. After the note combinations for chords are found, we incorporate them into the original song to produce a jazz style song.
3. Experiments
Methodology: We evaluate the performance of our algorithm using the midi jazz files. The experiments are conducted on a Linux machine 3.4GHz, 8Gb memory, and a GPU. To measure the goodness of the model, we report the bleu score and perplexity of the test set. The default hyperparameters can be found here.
Datasets: ~1500 jazz midi files are crawled. In that dataset, we use 802 files that have piano tracks for evaluation. We export chords and corresponding notes from midi files and separate chords into sentences with length L (default value L = 20 chords). After that, we put them to train, dev, and test datasets. Train and dev datasets are used for model training.
With default settings:
Vocabulary
3.1. Varying the chord representation.
In this section, we show the results with the 2 types of chord representation: complete name and simple name. The simple name is a more compact representation. Multiple chords with different complete names can have the same simple name.
Result sample:
Result sample:
Because using complete chord representation achieve better results, from now, we use it for the following experiments.
3.2 Varying the sentence length
We separate each song (in train, dev, test datasets) to multiple sentences because each chord can depend on only some previous chords. We vary the length of a sentence from 5 to 20 chords. Blue score:
As we can see in the above table, we have similar results when using different lengths of a sentence. Below we present the detailed test results with L = 5 and L = 10. The result with L = 20 is already presented in the previous section:
The full music sheets of songs and other transformed songs are available here
Result sample:
Combination of transformed piano and bass, harmonica:
Imagine
3.3. Adjusting Notes' Duration
Because another important element in Jazz is the spontaneous durations, we would like to mimic how the Jazz performers do. Jazz performers are always playing the notes and chords spontaneously, for example, 4 single quarter notes will probably be played as one dot and dot and one dot and dot to express different emotion. If we can try to predict the how the Jazz song plays the chord in different durations, it would make the song more close to Jazz.
Our first thought is using the previous result to translate a chord sequence into a duration sequence. It means we would like the answer the following question: "At this moment, the performer wants to play a C chord, how long will he/she play the note combination?".
We extracted the chords-durations mapping to training our model. Unfortunately, even turning different hyper-parameters, the initial results still converged to one or two different durations. The possible reason for this result is that the vocabulary of chords contains more than 2000 words, but there are only 120 different durations. Furthermore, most chords have been played as 16th notes (0.25). It makes playing 16th note may always get a higher score. After 12000 iterations (8 hours) training, the input sentences are easy to be inferred to all 16th notes, which is not our expectation.
At this point, translating chords to durations by NMT is not a successful approach. In conclusion, the chords are not related to the note durations directly in the dataset. There is another approach to this problem by generating note durations of a whole music bar (similar to generating drum patterns). However, due to the time constraint, we put this as one of our future work.
4. Related Work
Most of other music transfer works are belonging to two categories. In the first category, researchers tried to learn musical styles from samples to generate new and "random" songs of that style [1, 2, 3]. The authors also use musical knowledge to improve the quality of generated music pieces. In the second category, researchers tried to merge style the sound signals of a song (in WAV format) into another song [4, 5, 6] using techniques like WaveNet. However, this approach will require a large amount of data as well as resources for training. Rather than that, some simple transferring models give out very poor quality output [5].
In our approach, we have used some basic musical knowledge to transform the music transfer problem into sequence translation problem. We don't rely on any detailed music knowledge about Jazz, rather than that, we train model to be able to create Jazz felling. We believe our approach is the first of its kind.
5. Conclusions and Future Work
In this project, we have successfully transformed songs to their jazz versions using chord translation. The chords in the original song are played by the jazz style combination of notes. We have also conducted experiments of generating jazz-style durations, however, it will require more exploration. Here, we note some directions can be explored in the future:
Scale up/down the key of the dataset into the same key (e.g. C Major) to reduce the size of language vocabulary, which in turns increase the size of our training data.
Apply our approach to other types of instruments.
Fine tune the parameters and try different algorithms/models enhance the result.
Divide the whole song into many pieces for 2 bars or 4 bars duration, because usually, the music in certain duration (like a whole music sentence) is relevant, that might be better to predict how the chord machine can learn to play.
References
Members' contributions