continued pretrain - Githubissues

unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

16.46k stars 1.14k forks source link

continued pretrain #683

Open nkyc-no-name opened 3 months ago

nkyc-no-name commented 3 months ago

Hi,

I'm working on Continued pretraining - Korean + Unsloth.ipynb, using "Llama3_8b. In preparing data for pretrain, adding EOS_TOKEN doesn't seem to get applied to the loaded wikipedia dataset, although it is defined in "formatting_prompts_func". Please let me know if this is a mistake or deliberate.

Also, for the instruction finetuning on the same notebook, can I use SFTTrainer used for finetuning instead of UnslothTrainer? Any difference?

Thanks in advance!

danielhanchen commented 3 months ago

Oh good point - so sorry on the delay as well - relocated to SF hence the slowness. I shall edit the notebooks to add a EOS token