pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
2.6k stars 205 forks source link

Fine-Tuning Llama Model with Large Context and Customized Dataset Using Torchtitan #677

Open Amerehei opened 5 hours ago

Amerehei commented 5 hours ago

Hi,

I am trying to fine-tune a Llama model with a large context size, and I found that to efficiently shard activations across multiple GPUs, I need to use Torchtitan. Here are some questions related to my setup:

See related issue: meta-llama/llama-recipes#785

  1. Custom Dataset Usage
    I created a custom dataset using parquet files and a custom_dataset.py file, which is compatible with llama-recipes. I'm also using the DEFAULT_CHATML_CHAT_TEMPLATE. Could you please provide guidance on how to integrate and use this custom dataset effectively with Torchtitan?

  2. Fine-Tuning with Pretrained Model
    Is it possible to fine-tune the model starting from a pretrained checkpoint? If so, are there specific steps or configurations needed to achieve this with Torchtitan?

  3. Model Support (Llama-3.2-1B)
    I noticed that Torchtitan currently supports training Llama 3 models (8B, 70B) out of the box. What steps would I need to take if I wanted to train meta-llama/Llama-3.2-1B specifically?

  4. Large Context and FSDP Limitation
    I am unable to use FSDP because of the large context sizes I’m working with. Any additional guidance on handling large contexts effectively with Torchtitan would be appreciated.

Thank you for your help!

aniltrkkn commented 5 hours ago

For 3.1) You need to implement apply_scaling on top of Llama 3 code

https://github.com/meta-llama/llama-models/blob/main/models/llama3/reference_impl/model.py

Note: I just realized you asked for 3.2, I read it as 3.1. But linked library has 3.2 implementations which are trivial to port.