MusicGen Chord is the modified version of Meta's MusicGen Melody model, which can generate music based on audio-based chord conditions or text-based chord conditions.
You can demo this model or learn how to use it with Replicate's API here.
Cog is an open-source tool that packages machine learning models in a standard, production-ready container. You can deploy your packaged model to your own infrastructure, or to Replicate, where users can interact with it via web interface or API.
Cog. Follow these instructions to install Cog, or just run:
sudo curl -o /usr/local/bin/cog -L "https://github.com/replicate/cog/releases/latest/download/cog_$(uname -s)_$(uname -m)"
sudo chmod +x /usr/local/bin/cog
Note, to use Cog, you'll also need an installation of Docker.
git clone https://github.com/sakemin/cog-musicgen-chord
To run the model, you need a local copy of the model's Docker image. You can satisfy this requirement by specifying the image ID in your call to predict
like:
cog predict r8.im/sakemin/musicgen-chord@sha256:c940ab4308578237484f90f010b2b3871bf64008e95f26f4d567529ad019a3d6 -i prompt="k pop, cool synthwave, drum and bass with jersey club beats" -i duration=30 -i text_chords="C G A:min F" -i bpm=140 -i time_sig="4/4"
For more information, see the Cog section here
Alternatively, you can build the image yourself, either by running cog build
or by letting cog predict
trigger the build process implicitly. For example, the following will trigger the build process and then execute prediction:
cog predict -i prompt="k pop, cool synthwave, drum and bass with jersey club beats" -i duration=30 -i text_chords="C G A:min F" -i bpm=140 -i time_sig="4/4"
Note, the first time you run cog predict
, model weights and other requisite assets will be downloaded if they're not available locally. This download only needs to be executed once.
If you haven't already, you should ensure that your model runs locally with cog predict
. This will guarantee that all assets are accessible. E.g., run:
cog predict -i prompt="k pop, cool synthwave, drum and bass with jersey club beats" -i duration=30 -i text_chords="C G A:min F" -i bpm=140 -i time_sig="4/4"
Go to replicate.com/create to create a Replicate model. If you want to keep the model private, make sure to specify "private".
Replicate supports running models on variety of CPU and GPU configurations. For the best performance, you'll want to run this model on an A100 instance.
Click on the "Settings" tab on your model page, scroll down to "GPU hardware", and select "A100". Then click "Save".
Log in to Replicate:
cog login
Push the contents of your current directory to Replicate, using the model name you specified in step 1:
cog push r8.im/username/modelname
Learn more about pushing models to Replicate.
prompt
(string
) : A description of the music you want to generate.text_chords
(string
) : A text based chord progression condition. Single uppercase alphabet character(eg. C
) is considered as a major chord. Chord attributes like(maj
, min
, dim
, aug
, min6
, maj6
, min7
, minmaj7
, maj7
, 7
, dim7
, hdim7
, sus2
and sus4
) can be added to the root alphabet character after :
.(eg. A:min7
) Each chord token splitted by SPACE
is allocated to a single bar. If more than one chord must be allocated to a single bar, cluster the chords adding with ,
without any SPACE
.(eg. C,C:7 G, E:min A:min
) You must choose either only one of audio_chords
below or text_chords
.bpm
(number
) : BPM condition for the generated output. text_chords
will be processed based on this value. This will be appended at the end of prompt
.time_sig
(string
) : Time signature value for the generate output. text_chords
will be processed based on this value. This will be appended at the end of prompt
.audio_chord
(file
) : An audio file that will condition the chord progression. You must choose only one among audio_chords
or text_chords
above.audio_start
(integer
) : Start time of the audio file to use for chord conditioning.(Default = 0)audio_end
(integer
) : End time of the audio file to use for chord conditioning. If None, will default to the end of the audio clip.duration
(integer
) : Duration of the generated audio in seconds.(Default = 8)continuation
(boolean
) : If True
, generated music will continue from audio_chords
. If chord conditioning, this is only possible when the chord condition is given with text_chords
. If False
, generated music will mimic audio_chords
's chord.multi_band_diffusion
(boolean
) : If True
, the EnCodec tokens will be decoded with MultiBand Diffusion.normalization_strategy
(string
) : Strategy for normalizing audio.(Allowed values : loudness
, clip
, peak
, rms
/ Default value = loudness
)top_k
(integer
) : Reduces sampling to the k most likely tokens.(Default = 250)top_p
(number
) : Reduces sampling to tokens with cumulative probability of p. When set to 0
(default), top_k sampling is used.(Default = 0)temperature
(number
) : Controls the 'conservativeness' of the sampling process. Higher temperature means more diversity.(Default = 1)classifier_free_guidance
(integer
) : Increases the influence of inputs on the output. Higher values produce lower-varience outputs that adhere more closely to inputs.(Default = 3)output_format
(string
) : Output format for generated audio.(Allowed values : wav
, mp3
/ Default = wav
)seed
(integer
) : Seed for random number generator. If None
or -1
, a random seed will be used.<progression> ::= <bar> " " <bar>
<bar> ::= <chord> "," <chord>
<chord> ::= <note> ":" <shorthand>
<note> ::= <natural> | <note> <modifier>
<natural> ::= "A" | "B" | "C" | "D" | "E" | "F" | "G"
<modifier> ::= "b" | "#"
<shorthand> ::= "maj" | "min" | "dim" | "aug" | "maj7" | "min7" | "7" | "dim7" | "hdim7" | "minmaj7" | "maj6" | "min6" | "9" | "maj9" | "min9" | "sus4"
SPACE
is used as split token. Each splitted chunk is assigned to a single bar.
C G E:min A:min
,
.
C G,G:7 E:min,E:min7 A:min
:
.
C
, E
) is considered as a major chord.maj
, min
, dim
, aug
, min6
, maj6
, min7
, minmaj7
, maj7
, 7
, dim7
, hdim7
, sus2
and sus4
can be appended with :
.
E:dim
, B:sus2
#
and b
.
E#:min
Db
bpm
and time_sig
values must be specified.
bpm
can be a float value. (eg. 132
, 60
)time_sig
is (int)/(int)
. (eg. 4/4
, 3/4
, 6/8
, 7/8
, 5/4
)bpm
and time_sig
values will be automatically concatenated after prompt
description value, so you don't need to specify bpm or time signature information in the description for prompt
.audio_chords
.audio_start
and audio_end
values, you can specify which part of the audio_chords
file input will be used as chord condition.audio_chords
, using BTC model. continuation
is True
, then the input audio file given at audio_chords
will not be used as audio chord condition. The generated music output will be continued from the given file.audio_start
and audio_end
values to crop the input audio file.
duration
longer than 30 seconds.Assuming you have a local environment configured (i.e. you've completed the steps specified under Run with Cog), you can run training with a command like:
cog train -i dataset_path=@<path-to-your-data> <additional hyperparameters>
drop_vocals
is set to True
, the vocal tracks in the audio files will be isolated and removed (Default: True
).
drop_vocals=False
reduces data preprocessing time and maintains audio file quality.
01_A_Man_Without_Love.mp3
and 01_A_Man_Without_Love.txt
).one_same_description
argument to your desired description. In this case, there's no need for individual .txt files.auto_labeling
is set to True
, labels such as 'genre', 'mood', 'theme', 'instrumentation', 'key', and 'bpm' will be generated and added to each audio file in the dataset (Default: True
).
dataset_path
: Path = Input("Path to the dataset directory")one_same_description
: str = Input(description="A description for all audio data", default=None)"auto_labeling"
: bool = Input(description="Generate labels (genre, mood, theme, etc.) for each track using essentia-tensorflow
for music information retrieval", default=True)"drop_vocals"
: bool = Input(description="Remove vocal tracks from audio files using Demucs source separation", default=True)lr
: float = Input(description="Learning rate", default=1)epochs
: int = Input(description="Number of epochs to train for", default=10)updates_per_epoch
: int = Input(description="Number of iterations for one epoch", default=100) #If None, iterations per epoch will be set according to dataset/batch size. If a value is provided, the number of iterations per epoch will be set as specified.batch_size
: int = Input(description="Batch size", default=3)
epochs=3
, updates_per_epoch=100
, and lr=1
, the fine-tuning process takes approximately 15 minutes.batch_size
must be a multiple of 8. Otherwise, batch_size
will be automatically set to the nearest multiple of 8.chord
model, the maximum batch_size
is 16
with the specified 8 x Nvidia A40 machine setting.import replicate
training = replicate.trainings.create(
version="sakemin/musicgen-chord:c940ab4308578237484f90f010b2b3871bf64008e95f26f4d567529ad019a3d6",
input={
"dataset_path":"https://your/data/path.zip",
"one_same_description":"description for your dataset music",
"epochs":3,
"updates_per_epoch":100,
},
destination="my-name/my-model"
)
print(training)
effnet-discogs
from MTG's essentia
.librosa
.demucs
.