ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.74k stars 820 forks source link

What's a good data format for lora fine-tuning? #258

Open gladjoyhub opened 7 months ago

gladjoyhub commented 7 months ago

I know the recommended format is this: {"text": "Q:What is the capital of France?\nA:The capital of France is Paris."}

But some base model like Solar 10.4B recommends: .### User: What's the meaning of the word 'duck' in the following context?

.### Context: The anthology of classic games is replete with instances where a skillful duck altered the trajectory of bridge history.

.### Assistant: A skill in the game of bridge.

How should I adapt the format? Thanks!

awni commented 7 months ago

You can put whatever you want in the "text" field of the json. So in your case you could simply do:

{"text": ".### User:\nWhat's the meaning of the word 'duck' in the following context?\n\n.### Context:\n\nThe anthology of classic games is replete with instances where a skillful duck altered the trajectory of bridge history.\n\n.### Assistant:\nA skill in the game of bridge."}

The main thing is consistency. So if you use that format for LoRA training you should use the identical format for testing and generation.

USMCM1A1 commented 7 months ago

Edit (shared raw text without instruction formatting by mistake) I'm using a Mistral base which marks instructions w/ [INST] & [/INST]. So did: {"text": "<s>[INST] Q: \"What significant action did Marcus Aurelius take during his stay at Athens, and how did this reflect his priorities as Emperor?[/INST]\"\nA: During his stay at Athens, Marcus Aurelius took the significant action of endowing professorships in each of the great schools of doctrine, which included the Platonic, Aristotelian, Stoic, and Epicurean schools. This action demonstrated his priorities as Emperor in several ways.\n\nFirstly, by supporting these philosophical schools, Marcus Aurelius emphasized the importance of education and intellectual pursuits. He recognized that the pursuit of knowledge and wisdom was essential for personal growth and good governance. As a philosopher himself, he believed in the power of philosophy to shape character and inform sound decision-making.\n\nSecondly, by providing resources to all the major schools of thought, Marcus Aurelius showed his commitment to fostering a pluralistic intellectual environment. He did not favor one particular philosophical tradition but instead promoted a diversity of ideas, encouraging free thinking and open discussion. This approach reflected his belief in the value of critical inquiry and reasoned debate as cornerstones of a thriving society.\n\nLastly, endowing these professorships during his stay at Athens, an important center of learning, highlighted Marcus Aurelius' appreciation for cultural heritage and the preservation of knowledge. By investing in these institutions, he aimed to ensure their continued success and contribute to the ongoing development of human thought.\n\nIn summary, by endowing professorships in various philosophical schools during his stay at Athens, Marcus Aurelius demonstrated his priorities as Emperor in promoting education, intellectual diversity, and cultural heritage.</s>"} for each line.

chimezie commented 7 months ago

If just training on raw corpus, I have been using the raw text, per lora.py. For instruction datasets, I have been using the Mistral prompt format surrounding the input before tokenizing, but leaving the output as raw text:

INPUT: [INST] .. instruction ..[/INST] OUTPUT: .. raw text ..

wang-junjian commented 7 months ago

我修改了脚本 mlx-examples/lora/data/wikisql.py

if __name__ == "__main__":
    # ......
    for dataset, name, size in datasets:
        with open(f"data/{name}.jsonl", "w") as fid:
            for e, t in zip(range(size), dataset):
                """
                t 变量的文本是这样的:
                ------------------------
                <s>table: 1-1058787-1
                columns: Approximate Age, Virtues, Psycho Social Crisis, Significant Relationship, Existential Question [ not in citation given ], Examples
                Q: How many significant relationships list Will as a virtue?
                A: SELECT COUNT Significant Relationship FROM 1-1058787-1 WHERE Virtues = 'Will'</s>                
                """
                t = t[3:] # 去掉开头的 <s>,因为 tokenizer 会自动添加 <s>
                json.dump({"text": t}, fid)
                fid.write("\n")

微调后的效果还不错。

python -m mlx_lm.generate --model lora_fused_model \
                          --max-tokens 50 \
                          --prompt "table: students
columns: Name, Age, School, Grade, Height, Weight
Q: Query Wang Junjian’s name, age, and school information.
A: "
SELECT Name, Age, School FROM Students WHERE Name = 'Wang Junjian'