jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.71k stars 453 forks source link

A guide to adding more datasets #22

Closed VatsaDev closed 1 year ago

VatsaDev commented 1 year ago

One of the requirements is

I was looking through prepare_slimpajama.py and from what I can tell,

When I tried to look into the packed dataset, I notice its supposed to be a custom format dataset?

I think it would be very useful if you made a guide on preparing a dataset, like maybe an example of a small dataset on Colab, because most of our PCs can't handle the sheer file size of the tokens in the slimpajama and starcoder datasets.

jzhang38 commented 1 year ago

Normally you would want to finetune on Colab instead of pretraining. For that, I recommend you to check out Qlora.

VatsaDev commented 1 year ago

sorry I did mean adding a dataset for sft, I just quoted the requirement, because I believed that datasets would be loaded the same.

I can see multiple different dataset options loading on SFT script, can It be called the same for any custom dataset? Also The ETA for the finetuning time?

jzhang38 commented 1 year ago

It took ~1hour on 8 A40.