You could not prevent a thunderstorm, but you could use the electricity; you could not direct the wind, but you could trim your sail so as to propel your vessel as you pleased, no matter which way the wind blew.
— Cora L. V. Hatch
Levanter is a framework for training large language models (LLMs) and other foundation models that strives for legibility, scalability, and reproducibility:
We built Levanter with JAX, Equinox, and Haliax.
Levanter's documentation is available at levanter.readthedocs.io. Haliax's documentation is available at haliax.readthedocs.io.
jit
-ted functions.Levanter was created by Stanford's Center for Research on Foundation Models (CRFM)'s research engineering team. You can also find us in the #levanter channel on the unofficial Jax LLM Discord
Here is a small set of examples to get you started. For more information about the various configuration options,
please see the Getting Started guide or the In-Depth Configuration Guide.
You can also use --help
or poke around other configs to see all the options available to you.
After installing JAX with the appropriate configuration for your platform, you can install Levanter with:
pip install levanter
or using the latest version from GitHub:
git clone https://github.com/stanford-crfm/levanter.git
cd levanter
pip install -e .
wandb login # optional, we use wandb for logging
If you're developing Haliax and Levanter at the same time, you can do something like.
git clone https://github.com/stanford-crfm/levanter.git
cd levanter
pip install -e .
cd ..
git clone https://github.com/stanford-crfm/haliax.git
cd haliax
pip install -e .
cd ../levanter
Please refer to the Installation Guide for more information on how to install Levanter.
If you're using a TPU, more complete documentation for setting that up is available here. GPU support is still in-progress; documentation is available here.
As a kind of hello world, here's how you can train a GPT-2 "nano"-sized model on a small dataset.
python -m levanter.main.train_lm --config_path config/gpt2_nano.yaml
# alternatively, if you didn't use -e and are in a different directory
python -m levanter.main.train_lm --config_path gpt2_nano
This will train a GPT2-nano model on the WikiText-103 dataset.
You can also change the dataset by changing the dataset
field in the config file.
If your dataset is a Hugging Face dataset, you can use the data.id
field to specify it:
python -m levanter.main.train_lm --config_path config/gpt2_small.yaml --data.id openwebtext
# optionally, you may specify a tokenizer and/or a cache directory, which may be local or on gcs
python -m levanter.main.train_lm --config_path config/gpt2_small.yaml --data.id openwebtext --data.tokenizer "EleutherAI/gpt-neox-20b" --data.cache_dir "gs://path/to/cache/dir"
If instead your data is a list of URLs, you can use the data.train_urls
and data.validation_urls
fields to specify them.
Data URLS can be local files, gcs files, or http(s) URLs, or anything that fsspec supports.
Levanter (really, fsspec) will automatically uncompress .gz
and .zstd
files, and probably other formats too.
python -m levanter.main.train_lm --config_path config/gpt2_small.yaml --data.train_urls ["https://path/to/train/data_*.jsonl.gz"] --data.validation_urls ["https://path/to/val/data_*.jsonl.gz"]
You can modify the config file to change the model, the dataset, the training parameters, and more. Here's
the gpt2_small.yaml
file:
data:
train_urls:
- "gs://pubmed-mosaic/openwebtext-sharded/openwebtext_train.{1..128}-of-128.jsonl.gz"
validation_urls:
- "gs://pubmed-mosaic/openwebtext-sharded/openwebtext_val.{1..8}-of-8.jsonl.gz"
cache_dir: "gs://pubmed-mosaic/tokenized/openwebtext/"
model:
gpt2:
hidden_dim: 768
num_heads: 12
num_layers: 12
seq_len: 1024
gradient_checkpointing: true
scale_attn_by_inverse_layer_idx: true
trainer:
tracker:
type: wandb
project: "levanter"
tags: [ "openwebtext", "gpt2"]
mp: p=f32,c=bfloat16
model_axis_size: 1
per_device_parallelism: 4
train_batch_size: 512
optimizer:
learning_rate: 6E-4
weight_decay: 0.1
min_lr_ratio: 0.1
Currently, we support the following architectures:
We plan to add more in the future.
Here's an example of how to continue pretraining a Llama 1 or Llama 2 model on the OpenWebText dataset:
python -m levanter.main.train_lm --config_path config/llama2_7b_continued.yaml
Please see the TPU Getting Started guide for more information on how to set up a TPU Cloud VM and run Levanter there.
Please see the CUDA Getting Started guide for more information on how to set up a CUDA environment and run Levanter there.
We welcome contributions! Please see CONTRIBUTING.md for more information.
Levanter is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.