Training template - Githubissues

Hi, thanks for the great work. I have a question for how to transforming the training dataset to fit llama_factory format

I'd like to ask for advice on how to properly construct the training data format for llama_factory fine-tuning. I found FollowIR-7B's training set on huggingface, and the format is as follows:

{
  "score": "the score from Mistral-Instruct-7B-v0.2 of whether it was relevant or not (1 is relevant, 0 is not)"
  "label": "the label of relevance from GPT-3.5-Turbo-1106 who created the document"
  "id": "the id from the original TREC track and the file it came from"
  "document": "the synthetic document produced by GPT-3.5-Turbo-1106 given the original instruction, query, and label"
  "query": "the query written by TREC"
  "instruction": "the instruction (or narrative) written by TREC for human annotation"
}

For fitting the llama_factory 's format, Should the format I build for fine-tuning look like this:

{
   "instruction": "<s> [INST] You are an expert Google searcher, whose job is to determine if the following document is relevant to the query (true/false). Answer using only one word, one of those two choices.\n"
   "input": "Query: {query}  {instruction}\n Document: {document}\n Relevant (only output one word, either \"true\" or \"false\"): [/INST]"
   "output": "{label}"
}

I will appreciate it if you can give me an example for it.

orionw / FollowIR

Training template #8