formatting_prompts_func not used on wikipedia dataset on continued pretraining

x1250 commented 4 months ago

Helo guys, I noticed that in the continued pretraining colab for korean language the function formatting_prompts_func is not used to map the dataset of wikipedia:

def formatting_prompts_func(examples):
    # Function definition...

dataset = load_dataset("wikimedia/wikipedia", "20231101.ko", split = "train",)
dataset = dataset.train_test_split(train_size = 0.01)["train"]

But later in the alpaca dataset finetune, the function is defined again, and later is actually used.

def formatting_prompts_func(conversations):
    # Function definition...

alpaca_dataset = load_dataset("FreedomIntelligence/alpaca-gpt4-korean", split = "train")
alpaca_dataset = alpaca_dataset.map(formatting_prompts_func, batched = True,)

Is this the intended behavior or a bug? I just want to make sure I'm doing things the right way. Should the function also map the first dataset or it is not needed?

Thank you.

danielhanchen commented 4 months ago

Whoops my bad - adding it in!!

shimmyshimmer commented 4 weeks ago

@x1250 hopefully the issue is now resolved? Please let us know if not1 :)

unslothai / unsloth

formatting_prompts_func not used on wikipedia dataset on continued pretraining #740