unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
13.28k stars 873 forks source link

formatting_prompts_func not used on wikipedia dataset on continued pretraining #740

Open x1250 opened 3 weeks ago

x1250 commented 3 weeks ago

Helo guys, I noticed that in the continued pretraining colab for korean language the function formatting_prompts_func is not used to map the dataset of wikipedia:

def formatting_prompts_func(examples):
    # Function definition...

dataset = load_dataset("wikimedia/wikipedia", "20231101.ko", split = "train",)
dataset = dataset.train_test_split(train_size = 0.01)["train"]

But later in the alpaca dataset finetune, the function is defined again, and later is actually used.

def formatting_prompts_func(conversations):
    # Function definition...

alpaca_dataset = load_dataset("FreedomIntelligence/alpaca-gpt4-korean", split = "train")
alpaca_dataset = alpaca_dataset.map(formatting_prompts_func, batched = True,)

Is this the intended behavior or a bug? I just want to make sure I'm doing things the right way. Should the function also map the first dataset or it is not needed?

Thank you.

danielhanchen commented 3 weeks ago

Whoops my bad - adding it in!!