How can I finetune on my own data?

viewfinder-annn commented 8 months ago

Hi thanks for the great work!

I wonder how can I finetune your chat-checkpoint on my own dataset? Are there any resources I can refer to, such as dataset preperation and finetune script? Thanks!

hf-lin commented 8 months ago

Thank you for your interest. Please refer to the steps for our implementation.

Prepare sft data. Our dataset has three main features instruction, input and output. The overall format of a piece of data is like a conversation Human: {...} </s> Assistant: {...} </s>. For example,

  {
    "instruction": "Construct melodies by blending the designated musical pattern with the supplied motif.",
    "input": "['Binary', 'Sectional: Verse/Chorus'];X:1 L:1/16 M:2/4 K:G ['G2BG A2cA B2dB', '(gf)(ge) (ed)(cB)' </s> ",
    "output": "Assistant: X:1 L:1/16 M:2/4 K:G G2BG A2cA | B2dB G2B2 | c2ec B2dB | ABAG (GF)(ED) | G2BG A2cA | B2dB c2ec | cBAG D2f2 | g2d2B2G2 || (gf)(ge) (ed)(cB) | (gf)(ge) (ed)(cB) | ca2c Bg2B | ABAG GFED | G2BG A2cA | cBAG d2f2 | g2d2B2G2 || </s> "
  }

Check our MusicPile-sft for all samples. We recommend constructing your data in a format consistent with ours for finetuning based on ChatMusician-Base.

Preprocess data using script model/train/data_preprocess.py.
```
python model/train/data_preprocess.py \
    -t $TOKENIZER_PATH \
    -i $DATA_FILE \
    -o $OUTPUT_DIR \
    --tokenize_fn sft
```
This script processes the texts into token_ids, which has the advantage of saving GPU memory compared to runtime processing.
Run the experimental script. model/train/scripts/train.sh ${PREPROCESSED_DATASET_PATH} m-a-p/ChatMusician-Base. We used a single machine with V100*8 to finetune. You can modify the script to suit your needs.

petergreis commented 8 months ago

While the mechanics are clear, the data format has me a bit confused. If I take the example of an ABC format of a Mozart piece, would should the output section be? Input can be something like ‚Mozart‘, ‚K219‘, ‚X:1…..‘. I would like the feed the model more classical references as the training bias is massively tilted towards Irish music. Can you offer any insight here?

hf-lin commented 8 months ago

The output section depends on your needs. We designed eight tasks in our dataset, including music generation and music understanding (see paper for more details), for example, one of them is to write songs imitating Bach's style. The data input for this task is natural language commands that imitate Bach, and the output is the ABC notation of Bach's work. Here is a piece of data:

{
  "instruction": "Human: Write a song that has the characteristics of Bach's musical compositions. </s> ",
  "input": "",
  "output": "X:1
%%score 1 2 3 4
L:1/4
M:4/4
K:C
V:1 treble 
%%MIDI program 0
%%MIDI control 7 100
%%MIDI control 10 64
V:2 treble 
%%MIDI program 0
%%MIDI control 7 100
%%MIDI control 10 64
V:3 bass 
%%MIDI program 0
%%MIDI control 7 100
%%MIDI control 10 64
V:4 bass 
%%MIDI program 0
%%MIDI control 7 100
%%MIDI control 10 64
V:1
 z3 C/D/ | E F G G | F E !fermata!D G | A B c B | A2 !fermata!G C/D/ | E F G G | F E !fermata!D G | %7
 A B c B | A2 !fermata!G G | c B A G | F E !fermata!D G | F E D/E/ F | E D !fermata!C z |] %13
V:2
 z3 G, | C C D C/B,/ | A,/B,/ C !fermata!B, D | E/^F/ G A G | G ^F !fermata!D G, | C C D C/B,/ | %6
 A,/B,/ C !fermata!B, D | E/^F/ G A G | G ^F !fermata!D D | E D C/D/ E | %10
 D G,/A,/ !fermata!B, E/D/ | C/ D C B,/ C | C B, !fermata!G, z |] %13
V:3
 z3 E,/F,/ | G, F,/E,/ D, E, | F, G, !fermata!G, G, | C D D D | E D/C/ !fermata!B, E,/F,/ | %5
 G, F,/E,/ D, E, | F, G, !fermata!G, G, | C D D D | E D/C/ !fermata!B, B,/A,/ | %9
 G,/E,/ F,/G,/ A, A, | A,/B,/ C !fermata!G, B, | A,/G,/ G, G, F, | G,3/2 F,/ !fermata!E, z |] %13
V:4
 z3"C" C, |"C" C,/B,,/"F/A" A,,"G/B" B,,"C" C, |"Dm" D,"C/E" E,/F,/"G" !fermata!G,"G" B, | %3
"Am" A,"G" G,"D7/F#" ^F,"G" G, |"Am7/C" C,"D" D,"G" !fermata!G,,"C" C, | %5
"C" C,/B,,/"F/A" A,,"G/B" B,,"C" C, |"Dm" D,"C/E" E,/F,/"G" !fermata!G,"G" B, | %7
"Am" A,"G" G,"D7/F#" ^F,"G" G, |"Am7/C" C,"D" D,"G" !fermata!G,,"G" G,/F,/ | %9
"C/E" E,/C,/"Bdim/D" D,/E,/"F" F,/E,/"A7/C#" D,/^C,/ |"Dm" D,"C/E" E,/^F,/"G" !fermata!G,"Em" E, | %11
"F/A" A,,/B,,/"C" C,"Gsus4" G,,"F/A" A,, |"C/G" G,,/F,,/"G" G,,"C" !fermata!C,, z |] %13 </s> "
}

petergreis commented 7 months ago

A further refinement on my question - so given your example above:

"instruction": "Human: Write a song that has the characteristics of Bach's musical compositions. Is this then the same for all Bach pieces that are offered up for fine tuning? I have a reasonable corpus of Mozart gathered, transposed and converted ABC format. Should there not be something more in the input field like ["Mozart", "K219"] ? Just trying to get the best result possible here...

hf-lin commented 7 months ago

A further refinement on my question - so given your example above:

"instruction": "Human: Write a song that has the characteristics of Bach's musical compositions. Is this then the same for all Bach pieces that are offered up for fine tuning? I have a reasonable corpus of Mozart gathered, transposed and converted ABC format. Should there not be something more in the input field like ["Mozart", "K219"] ? Just trying to get the best result possible here...

Not every piece of data has the same instruction field, and we have rewritten the same instruction into several different descriptions in our dataset. The input may contain more specific information about the composition.

petergreis commented 7 months ago

Just a quick update in case anyone cares -I have updated the preprocess script to also accept a csv file for tokenisation . I will make it available for pull once it's cleaned up a bit.

petergreis commented 7 months ago

I noticed that the input field needs to be blank, otherwise a fine tuned model produced crazy hallucinations.

hf-lin / ChatMusician

How can I finetune on my own data? #4