instructlab / training

InstructLab Training Library - Efficient Fine-Tuning with Message-Format Data
https://pypi.org/project/instructlab-training/
Apache License 2.0
21 stars 45 forks source link

spin out a simpler trainer that could be easily understood as a standalone repository. #50

Open aldopareja opened 5 months ago

aldopareja commented 5 months ago

Redesigning the InstructLab Training Repository

We aim to redesign the InstructLab training repository with a focus on simplicity, modularity, and performance. The goal is to create a standalone, research-oriented trainer that can be used by anyone experimenting with new research directions.

Philosophy

The philosophy behind this redesign is to create a "small form factor and extremely fast (throughput-wise) trainer". This trainer should be easy to read, understand, and modify. Users might want to change how gradients are aggregated, create new samplers based on gradient statistics, support new types of Language Models (LLMs), optimizers, and so on.

By creating a trainer that is easy to use and modify, we hope to attract more users from the research community and foster faster innovation. We aim to provide an alternative to trainers like the one from Hugging Face, which, while comprehensive, can be complex due to its support for a wide variety of data formats, sharding strategies, accelerators, and abstractions.

Structure

The fine tuning trainer should be shallow, with a simple script-wise separation across the following areas:

  1. Tokenization and Data Preprocessing: This component will handle the conversion of raw data into a format suitable for model training.
  2. Model Loading and Modifications: This component will manage the loading of pre-trained models and any necessary modifications, such as overloading the loss function or changing the way embeddings are computed.
  3. Sharding Strategies, Quantization, and LoRA Wrapping: This component will handle the distribution of data and model parameters across multiple devices, as well as any necessary quantization or LoRA wrapping.
  4. Distributed Data Sampling: This component will manage the sampling of data for distributed training.
  5. Checkpointing: This component will handle the saving and loading of model checkpoints.
  6. Training Loop: This component will manage the main training loop, including forward and backward passes, optimization steps, and gradient updates.
  7. Logging and Monitoring: This component will handle the logging of training metrics and any necessary monitoring of the training process.

By structuring the trainer in this way, each component can be modified (mostly) independently, making it easier for users to customize the trainer to their needs. This structure also makes the trainer easier to understand, as each component has a clear, well-defined role.

Advantages

This approach has several advantages:

russellb commented 5 months ago

Thank you for writing this up! One small question -- what would you say makes it "research-oriented" vs. more general purpose (for research and beyond)? When I hear "research-oriented", I also hear "not for production use," which I'm pretty sure isn't your intention!

aldopareja commented 5 months ago

a "stable" version of the upstream repository will be the production branch. Whatever has been battle tested and shown to work. But the upstream repo should move faster.

aldopareja commented 5 months ago

and faster moving in training is necessarily research-oriented.

aldopareja commented 5 months ago

it's just moving the functions around and organizing, no core logic will be changed since this is as "edge" as you can get atm and it's what will be the first production-rated trainer (since we have tested that everything works as expected).

russellb commented 5 months ago

a "stable" version of the upstream repository will be the production branch. Whatever has been battle tested and shown to work. But the upstream repo should move faster.

OK. It sounds like you also have a git branch and release management strategy in mind? This seems important to capture somewhere.

aldopareja commented 5 months ago

branch vs tags vs forks. That should be discussed and decide whatever people think is better for the community.

github-actions[bot] commented 6 days ago

This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.