pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
2.29k stars 170 forks source link

Custom dataset for llama 3 finetuning #310

Closed rshah918 closed 5 months ago

rshah918 commented 5 months ago

Any docs on how we can plug in a custom dataset for Llama 3 8B fine-tuning? Also, is q-lora supported?

lessw2020 commented 5 months ago

Hi @rshah918 - To address your questions. 1 - qlora - titan does not support qlora fine tuning. However, torchtune is all setup for that: https://github.com/pytorch/torchtune

2 - you can plug in custom datasets by extending the hf_datasets file here: https://github.com/pytorch/torchtitan/blob/f72a2a0da0bdfc394faaab9b3c0f35d0b6f5be50/torchtitan/datasets/hf_datasets.py#L20 assuming this is an hf style dataset. We don't have any docs at the moment for adding a dataset though that's probably a good todo item.

3 - you could use titan for full fine tuning (all params are being updated) by simply adding your dataset and fine tuning from there. However, torchtune is also setup for full fine tuning as well, it probably depends on the scale of gpus you are planning to fine tune with where < 8 torchtune is likely more optimal.

Hope this helps!

lessw2020 commented 5 months ago

added an issue re: adding better doc for adding custom datasets - https://github.com/pytorch/torchtitan/issues/311

I'm going to close this issue for now but please feel free to re-open if needed.