gururise / AlpacaDataCleaned

Alpaca dataset from Stanford, cleaned and curated
Apache License 2.0
1.46k stars 146 forks source link

How to format dataset fields in model prompt? #63

Closed dangbert closed 2 months ago

dangbert commented 2 months ago

Hi I'm looking to finetune an LLM using this dataset, and was wondering if there's any advice on how to format the prompt given the instruction vs input fields?

For example consider these entries:

  {
    "output":"The author has used personification in the sentence \"The cold breeze chills my bones.\" Personification is a figure of speech in which a non-human subject is given human characteristics. In this case, the non-human subject is the cold breeze, which is given the human characteristic of being able to chill someone's bones.",
    "input":"The cold breeze chills my bones.",
    "instruction":"Identify a stylistic device used by the author in the following sentence."
  }

 {
    "output":"Two players from the Kansas City Chiefs team are Patrick Mahomes and Tyreek Hill.",
    "input":"",
    "instruction":"Name two players from the Chiefs team?"
  }

I imagine two approaches:

  1. Use the "instruction" as the system prompt, and the "input" as the first user chat message (which would often be empty though)...
  2. Concatenate the instruction + input fields into a single (first) user chat message.

I think I'll use approach 2 but would appreciate any insights or references on this topic :)

dangbert commented 2 months ago

I think this file mostly answers my question https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py#L31 permalink

and this file in torchtune is interesting as well (references the above link) https://github.com/pytorch/torchtune/blob/main/torchtune/datasets/_alpaca.py permalink