Closed VatsaDev closed 1 year ago
Normally you would want to finetune on Colab instead of pretraining. For that, I recommend you to check out Qlora.
sorry I did mean adding a dataset for sft, I just quoted the requirement, because I believed that datasets would be loaded the same.
I can see multiple different dataset options loading on SFT script, can It be called the same for any custom dataset? Also The ETA for the finetuning time?
It took ~1hour on 8 A40.
One of the requirements is
Add scripts for pretraining on other datasets.
I'm assuming that the pretrain dataset script would still work for a finetune script, as the data is processed the same?
I was looking through
prepare_slimpajama.py
and from what I can tell,When I tried to look into the packed dataset, I notice its supposed to be a custom format dataset?
I think it would be very useful if you made a guide on preparing a dataset, like maybe an example of a small dataset on Colab, because most of our PCs can't handle the sheer file size of the tokens in the slimpajama and starcoder datasets.