Git-Theta is a Git extension for collaborative, continual, and communal development of machine learning models.
Version control systems like Git enable large distributed teams to collaborate on shared codebases by tracking changes over time and providing tools for merging changes from multiple sources. Git-Theta is a Git extension that aims to provide similar functionality for machine learning model checkpoints by efficiently and meaningfully track a model's version history natively through Git. Specifically, rather than treating the checkpoint as a blob of data (as done by other systems for tracking models with Git), Git-Theta
Git-Theta is currently under active development and should be used with caution. For feature discussions and debugging help, please join the #git-theta stream in the CCCML Zulip community. If you use Git-Theta as part of a published research project, please cite our paper.
Download and install Git LFS using the instructions from the Git LFS website.
1) Install the git-theta Python package:
pip install git-theta
By default, installing git-theta
with pip
will not install any of the supported machine learning frameworks (PyTorch, TensorFlow, etc.).
If you want to install the framework you intend to use when installing git-theta
, you can specify it when installing (e.g. by running pip install git-theta[pytorch]
for PyTorch).
2) Configure Git to use Git-Theta when tracking model checkpoints:
git theta install
Say you have a codebase for training a model along with the model's checkpoint:
my_codebase
├── model.pt
└── train.py
Git-Theta allows you to use Git to track the changes to your code and your model's parameters in tandem. To use Git-Theta to track the model checkpoint, first run
git theta track model.pt
This will create or update the .gitattributes
file that tells Git to use Git-Theta to handle the checkpoint file.
You can then add and commit the .gitattributes
file:
git add .gitattributes
git commit
After tracking the model, you can regular Git commands (add
, commit
, push
, pull
, checkout
, status
, diff
, etc.) as if the checkpoint file was any other file.
To add and commit the initial version of the checkpoint, simply run
git add model.pt
git commit
Additionally, git theta add
can be used instead of git add
to provide optional extra information, including e.g., the checkpoint format with --checkpoint-type
, the Update
used to update parameters with --update-type
, and the location of auxiliary information/data for the update with --update-path
.
For example, if the model was updated using using LoRA, the low-rank factors can be efficiently stored by Git-Theta by running:
# After training with LoRA and saving the factors to updates.pt...
git theta add model.pt --update-type low-rank --update-path updates.pt
git commit
Git-Theta can also handle merging of models trained with differing updates.
For example, if an existing model is further trained on a new branch called alternate-training
:
git checkout -b alternate-training
# After performing training...
git add model.pt
git commit
and is separately trained on the main branch:
git checkout main
# After some other training...
git add model.pt
git commit
We then can then merge the updates from the alternate-training
branch via a standard git merge
:
git merge alternate-training
Git-Theta supports various methods for automatically merging models, including parameter averaging. The merge tools shows us each parameter that is different between the two models and asks what merge operation to perform.
Git-Theta supports various workflows for efficiently tracking updates to a checkpoint.
Under the hood, Git-Theta tracks changes to a checkpoint at the parameter group level. A parameter group is a semantically-grouped collection of parameters like a weight matrix or bias vector in a neural network. Parameter groups are determined based on the structure of the checkpoint file itself as specified in the format-specific Checkpoint class. In the simplest case where all of the parameters of a model are updated, Git-Theta will effectively store an entirely new copy of the checkpoint. However, if only a subset of the model's parameter groups are updated, Git-Theta will only store the updates to the changed parameter groups, which saves space and communication costs. Similarly, if a model is updated by adding new parameter groups, Git-Theta will only store the new parameter groups.
Beyond updating a subset of a model's parameter groups, Git-Theta also natively supports parameter-efficient updates. Examples of parameter-efficient updates include updating a sparse subset of the model's parameters (as in FISH Mask or Diff Pruning) or applying a low-rank update (as in LoRA). There are multiple workflows for efficiently tracking parameter-efficient updates with Git-Theta.
A simple way to track parameter-efficient updates is to store the information required to produce the update (e.g., the low-rank factors for LoRA or the indices and values for a sparse update) as new parameter groups in the checkpoint file itself. In this case, model code handles creating and applying the update and the checkpoint is saved and loaded as usual.
Pros:
Cons:
After saving update information in the checkpoint, the new checkpoint can be committed simply using git add
and git commit
as usual.
A second option is to apply the updates to the parameter groups before saving them. Git-Theta will treat these updates in the same way it treats updating all parameters in a parameter group, so this approach sacrifices any savings to communication or storage costs that would have been achieved by using a parameter-efficient method.
Pros:
Cons:
After folding the updates into the parameter groups, the model can be saved, added, and committed as usual.
Another option is to save parameter-efficient update information in a separate file from the original checkpoint. This maintains storage and communication efficiency at the cost of requiring additional implementation overhead.
Pros:
Cons:
Assuming we have already committed the original model, the auxiliary information checkpoint needs to be separately added and committed as normal.
To streamline the workflow of saving update information externally, Git-Theta has functionality for applying the update as part of the version control process.
This ties together the main model checkpoint and the update checkpoint to prevent them from diverging.
In addition, Git-Theta takes care of applying the update so that the model checkpoint can be used as-is after checkout.
Git-Theta assumes assumes that the update information checkpoint uses the same format as the original checkpoint and that the names of updates are prefixed by the name of the parameter group they are applied to.
For example, if a parameter group called /layer1/weights
was updated with a low-rank update, then Git-Theta would look for parameters named /layer1/weights/R
and /layer1/weights/C
in the update information checkpoint based on the naming conventions in the LowRankUpdate
class.
The low-rank update can then be efficiently tracked and applied with Git-Theta via
git theta add /path/to/original/checkpoint.ckpt --update-type low-rank --update-path /path/to/updates.ckpt
git commit
Note that using this approach requires using git theta add
instead of just git add
to allow for additional command line arguments.
Updates that involve modifying existing parameters (rather than just completely replacing them) are referred to by Git-Theta as "incremental updates" and are handled via a plugin system (described below).
Git-Theta provides principled and rigorous way to keep track of different versions of a model based on the standard version control workflow.
Pre-trained models are increasingly being continually updated to make them applicable to new tasks and domains. For example, a pre-trained language model might be adapted to a new objective, process text in a new domain, and improve its instruction-following capabilities before being fine-tuned on a target task. Git-Theta allows the provenance of these steps to be straightforwardly tracked using Git's built-in functionality. Apart from committing each model to keep track of a checkpoint's history, other Git functionality like tagging can be used to keep track of notable versions. When checking out a particular version of a model, Git-Theta will only download what's required to reconstruct it and won't download any files that have already been cached.
Model development is not always straightforward - often we want to try out different versions of a base model, or we might create different versions that are applicable to different tasks.
Git-Theta supports this mode of development natively simply by using Git's branch feature - simply create a new branch (git checkout -b
), modify the model, and add and commit it as usual.
This provides a straightforward workflow for trying out different ways to update a model.
If parameter groups are shared across checkpoints being tracked by Git-Theta (whether they are on the same or different branches), Git-Theta will only store a single copy of each parameter group.
Contributors can also develop their own updated versions of a model by forking the base repository.
If different versions of a model are created on different branches or repositories, Git-Theta will handle merging them.
When git merge
is run and there is a merge conflict between two histories of a model, Git-Theta will automatically open its merge tool.
Git-Theta's merge tool currently supports basic resolution patterns like choosing the parameters from one of the models or merging parameter groups via averaging.
For more sophisticated merges, the environment variable GIT_THETA_MANUAL_MERGE
can be set to true when performing the merge operation, i.e.
export GIT_THETA_MANUAL_MERGE=True
git merge ${other-branch}
and the merge tool will write out 3 copies of the model, one for each branch being merged and an additional one that represents the model at the most recent commit in the history of both branches. The merge tool will also specify where to save the merged model. After the merged model has been saved to the specified location, a merge commit can be created as usual.
Git-Theta aims to support all standard Git workflows. However, there are currently some situations that Git-Theta does not currently support.
Currently, git rebase
is not supported when special update types are used.
Additionally, repeated merge-conflict resolution---often encountered in a rebase---can be onerous for large models.
Currently, git-theta's merge utilities are optimized for (and only tested for) 3-way merges where two branches with a shared ancestor commit are merged together. We are working on support for Octopus merges where multiple branches are all combined at once.
This section describes how Git-Theta works in more detail.
Git offers several points of customization where specialized, model-aware Git-Theta versions of various tools are run. Git has a "working tree" where human-facing files live and a "staging area" where a copies of working tree files live before they are stored in Git. When a file is moved from the working tree to the staging area, the "clean filter" is run. When it is moved back the "smudge filter" is run. Git-theta provides model-aware versions of these filters.
When a model checkpoint is cleaned (git add
):
Thus, Git itself only tracks the model metadata; actual values are stored efficiently Git LFS. Additionally, by checking for matching metadata, only changed parameters are stored.
When a model checkpoint is smudged (git checkout
):
When installing Git-Theta with git theta install
, the following lines are added to the global ~/.gitconfig
:
[filter "theta"]
clean = git-theta-filter clean %f
smudge = git-theta-filter smudge %f
required = true
[merge "theta"]
name = Merge Models with Git-Theta
driver = git-theta-merge %O %A %B %P
[diff "theta"]
command = git-theta-diff
This configuration defines two Git filter drivers for Git-Theta and registers them under the name theta
.
In addition, it defines merge and diff programs, also named theta
.
When git theta track path/to/model
is run, an entry is added to the .gitattributes
file to configure Git to use Git-Theta. The new entry looks like
path/to/model filter=theta merge=theta diff=theta
This tells git that anytime a file that matches the pattern path/to/model
is processed, use the filter/merge/diff driver named theta
.
Git-Theta supports updates that are based on the previous version of the parameter values.
For example, if a few entries of a parameter group are updated, Git-Theta can avoid storing a new copy of the parameter group; instead, it can be computed on the fly during a smudge filter based on the sparse update and the previous value.
Such updates are implemented as subclasses of the IncrementalUpdate
class.
IncrementalUpdate
s include references to the commit that holds the last parameter value in their metadata.
Then, when the new value is needed, the IncrementalUpdate
class will fetch the value of the previous parameter from git and apply the current update.
This yields a massive reduction in storage costs.
Additionaly, this can be done recursively, i.e. Git-Theta will continuous fetch previous values and apply IncrementalUpdate
s until a self-contained update (such as a Dense
update that replaces all parameter values with new ones) is hit.
To avoid processing parameter groups that have not been changed, Git-Theta needs a way to determine whether a given parameter group's values have changed.
Directly testing for equality or comparing bitwise hashes might be overly strict due to numerical instability and noise that could arise from using incremental updates, different hardware, or different software stacks.
Instead, Git-Theta uses uses locality sensitive hashing (LSH) for parameter hashes.
Specifically, an LSH that approximates Euclidean distance and uses the random-pool approach to hash parameters of variable sizes.
Git-Theta's LSH uses 16 hash functions and is calibrated so that two parameter groups with a Euclidean distance less than $1e^{-8}$ will have the same hash with a probability of at least $0.99$.
Additionally, weights with a distance $\in [1e{-8}, 1e^{-6}]$ are double-checked with numpy.allclose
.
Git-theta makes heavy use of python plug-ins to enable users to add support for additional checkpoint formats as well as custom merge patterns and incremental updates.
Specifically, Git-Theta currently support plug-ins for the Checkpoint
, Update
, and Merge
classes.
Third-party users can register a plug-in by creating a small installable package that defines the plugin and registers it as an entry point under the name scope git_theta.plugins.(checkpoints|updates|merges)
.
An example plugin for JSON formatted checkpoints can be found here.
Alternatively, plug-ins can be added directly to the git-theta
package by adding new subclasses to the appropriate modules, then declaring it in the entry_points
dict in setup.py
.
This project uses black
for code formatting and isort
for import statement ordering. Additionally, it includes CI that checks for compliance.
We include pre-commit hooks that will automatically run black
and isort
against any python files staged for commit.
These hooks can be installed with:
$ pip install -r requirements-dev.txt
$ pre-commit install
When one of these tools must reformat your file, it will show as the pre-commit hook failing and your commit will be cancelled.
Reformatted source files will appear in your working directory ready to be re-added to staging (git add
).
Running git commit -m ${msg}
again will result in the hooks passing and the commit actually happening. Note: As your initial commit was blocked, you will probably want to use the same message in the commit that actually goes through.
If you use git-theta in your work, please cite:
@InProceedings{kandpal-etal-2023-git-theta
title={Git-Theta: A Git Extension for Collaborative Development of Machine Learning Models},
author={Kandpal, Nikhil and Lester, Brian and Muqeeth, Mohammed and Mascarenhas, Anisha and Evans, Monty and Baskaran, Vishal and Huang, Tenghao and Liu, Haokun and Raffel, Colin},
journal={International Conference on Machine Learning, {ICML}},
year={2023},
month={july},
url={https://arxiv.org/abs/2306.04529},
}