This PR covers seq-to-seq finetuning (such as IFT) for LLMs. These changes create a new dataloading codepath. All major changes are within llm/src/data/finetuning/.
At a high level:
finetuning/collator.py a one-size-fits-all (for single-turn) seq2seq collator that turns properly formatted, tokenized examples into batches suitable for training with any of our LLM models.
finetuning/tasks.py provides a scaffold for registering functions to format+tokenize examples from different datasources.
finetuning/dataloader.py exposes build_finetuning_dataloader and interprets the dataloader config.
finetuning/convert_finetuning_dataset.py simplifies the conversion of HF datasets into an MDS format that the dataloader can build from.
finetuning/README.md explains how to use these tools.
Other changes:
Various __init__s to simplify importing
main.py can call build_finetuning_dataloader
Example finetuning YAMLs in yamls/mosaic_gpt/finetuning/
This PR covers seq-to-seq finetuning (such as IFT) for LLMs. These changes create a new dataloading codepath. All major changes are within
llm/src/data/finetuning/
.At a high level:
finetuning/collator.py
a one-size-fits-all (for single-turn) seq2seq collator that turns properly formatted, tokenized examples into batches suitable for training with any of our LLM models.finetuning/tasks.py
provides a scaffold for registering functions to format+tokenize examples from different datasources.finetuning/dataloader.py
exposesbuild_finetuning_dataloader
and interprets the dataloader config.finetuning/convert_finetuning_dataset.py
simplifies the conversion of HF datasets into an MDS format that the dataloader can build from.finetuning/README.md
explains how to use these tools.Other changes:
__init__
s to simplify importingmain.py
can callbuild_finetuning_dataloader
yamls/mosaic_gpt/finetuning/
packing.py