A guide to adding more datasets

VatsaDev commented 1 year ago

One of the requirements is

Add scripts for pretraining on other datasets.

I'm assuming that the pretrain dataset script would still work for a finetune script, as the data is processed the same?

I was looking through prepare_slimpajama.py and from what I can tell,

Data is taken in as JsonL files, and tokenized into a "packed dataset"

When I tried to look into the packed dataset, I notice its supposed to be a custom format dataset?

I think it would be very useful if you made a guide on preparing a dataset, like maybe an example of a small dataset on Colab, because most of our PCs can't handle the sheer file size of the tokens in the slimpajama and starcoder datasets.

jzhang38 commented 1 year ago

Normally you would want to finetune on Colab instead of pretraining. For that, I recommend you to check out Qlora.

VatsaDev commented 1 year ago

sorry I did mean adding a dataset for sft, I just quoted the requirement, because I believed that datasets would be loaded the same.

I can see multiple different dataset options loading on SFT script, can It be called the same for any custom dataset? Also The ETA for the finetuning time?

jzhang38 commented 1 year ago

It took ~1hour on 8 A40.

jzhang38 / TinyLlama

A guide to adding more datasets #22