ldzhangyx / instruct-MusicGen

The official implementation of our paper "Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning".
Apache License 2.0
47 stars 3 forks source link

Example structure for dataset #1

Closed Saltb0xApps closed 3 weeks ago

Saltb0xApps commented 1 month ago

Hey! Amazing work modifying musicgen with instruct capabilities. I have a dataset of about 300k audio files that are copyright free and i want to train the model from scratch.

I'm wondering that does training this require just instrumental tracks with descriptions, or do we need train this on individual stems (possibly split from the instrumentals using demucs?).

It would be really helpful if you could share 2-5 examples that represent the quality and structure of the dataset that was used in the paper like the audiocraft repo - https://github.com/facebookresearch/audiocraft/tree/main/dataset/example

ldzhangyx commented 1 month ago

Hi, thanks for using my code repo. I already wrote the dataloader for Slakh dataset and moisesDB dataset which you can refer to. For the custom dataset, the data is returned in (input audio, output audio, instruction) format from the dataloader.

ldzhangyx commented 1 month ago

For the data, yes, this model requires paired data, which you usually can get it from separate stem.

Saltb0xApps commented 1 month ago

So if i understand correctly, putting in the same data that works for the base Musicgen model (instrumentals) will not work for this. We need a dataset that consists of -

  1. base instrumental track
  2. base instrumental track minus each stem (guitar/bass/drums/piano/others)
  3. Individual stems only of each track

What structure works best for the descriptions? For example -

  1. "An energetic hip hop track with guitars, piano, and drums"
  2. "An energetic hip hop track"
  3. "An energetic hip hop track with guitars"

Will dive further into moisesDB & Slakh dataset to really understand the details! Thank you :)

ldzhangyx commented 1 month ago

Since instruct-MusicGen is a model for music editing, then a sample data could be: (mix without stem, mix with stem, "instruct: add [] stem. ") or (mix with stem, mix without stem, "instruct: remove [] stem. ") or (mix with stem, stem, "instruct: extract [] stem. ")

From the original paper there is no specific description for the stem, but I think it is okay to add desctiption to the instruction. You can try both "An energetic hip hop track with guitars, piano, and drums. instruct: extract [] stem. " or "An energetic hip hop track. instruct: extract energetic [] stem. " or "Music. instruct: extract energetic [] stem. "