unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.37k stars 1.28k forks source link

Dataset for train to translate language #1303

Open nichellehouston opened 4 days ago

nichellehouston commented 4 days ago

I want to use https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing for training model to translate language with this dataset https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-fr how can use it in notebook Data Prep?

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction: {}

Input: {}

Response: {}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN def formatting_prompts_func(examples): instructions = examples["instruction"] inputs = examples["input"] outputs = examples["output"] texts = [] for instruction, input, output in zip(instructions, inputs, outputs):

Must add EOS_TOKEN, otherwise your generation will go on forever!

    text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
    texts.append(text)
return { "text" : texts, }

pass

from datasets import load_dataset dataset = load_dataset("yahma/alpaca-cleaned", split = "train") dataset = dataset.map(formatting_prompts_func, batched = True,)

Erland366 commented 3 days ago

For me I'd set the instruction as "Translate this from English to French" (hardcoded) and set the inputs as inputs=examples["translation"]["en"] and outputs as outputs=examples["translation["fr"] or you can just do it vice versa etc (just set whatever the source and destination language)

Then on inference, make sure you use the same template as well