nnScaler: Compiling DNN models for Parallel Training over Multiple Devices

What is nnScaler?

nnScaler is a parallelization engine that compiles a Deep neural network (DNN) model that designed for single-GPU execution into a program that capable of running in parallel across multiple GPUs.

System Highlights:

Ease of Use: Only a few lines of code need to be changed to enable automated parallelization.
Pythonic: The parallelization output is in PyTorch code, making it easy for users to understand and convenient for further development or customization.
Extensibility: nnScaler exposes an API to support new operators for emerging models.
Reliability: Verified through various end-to-end training sessions, nnScaler is a dependable system.
Performance: By exploring a large parallelization space, nnScaler can significantly enhance parallel training performance.

For DNN scientists, they can concentrate on model design with PyTorch on single GPU, while leaving parallelization complexities to nnScaler. It introduces innovative parallelism techniques that surpass existing methods in performance. Additionally, nnScaler supports the extension of DNN modules with new structures or execution patterns, enabling users to parallelize their custom DNN models.

For DNN system experts, they can leverage nnScaler to explore new DNN parallelization mechanisms and policies for emerging models. By providing user-defined functions for new operators not recognized by nnScaler, it ensures seamless parallelization of novel DNN models. For example, to facilitate long sequence support in LLMs.

Quick start

Installation

Prerequisite

Install the following packages before the installation of nnScaler:

Python >= 3.8, < 3.11 (3.10 is recommanded)

PyTorch >= 2.0, < 2.4 (2.2.0 is recommanded)

Install nnScaler from source

Execute below commands in nnScaler directory:

pip install -r requirements.txt
pip install -e .

Besides, to avoid cppimport error, it also needs to include nnScaler directory in environment variable PYTHONPATH:

export NNSCALER_HOME=$(pwd)
export PYTHONPATH=${NNSCALER_HOME}:$PYTHONPATH

Example Llama-3

Prerequisite for Llama-3

Install packages required to run Llama-3. Besides, a certain version of CUDA library is needed during flash-attn installation. For example, CUDA V11.8 is needed if using PyTorch 2.20.

python -m pip install transformers==4.40.0 flash-attn==2.5.5 tensorboard

Model Access

Obtain access of Llama-3 model from HuggingFace, where you will receive an access token which should be set as an environment variable:

export HF_TOKEN=<HUGGINGFACE_ACCESS_TOKEN>

Code Changes for Parallelization

You can find all the example code at examples/llama3_8B_128K. As shown below, a user needs to:

Wrap the Model: Include loss computation and other necessary components.
Configure Components: Set up the model, optimizer, and dataloader.
Initialize and Start: In the main function, create an nnScaler trainer with the above configurations and start the training process.

# import the nnScaler build-in parallelization-capable trainer
from nnscaler.cli.trainer import Trainer

# wrap model to include loss computing, etc.
class WrapperModel(torch.nn.Module):
    def __init__(self, model_id):
        super().__init__()
        self.model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation='flash_attention_2')

    def forward(self, samples):
        outputs = self.model.model(
            input_ids=samples['net_input']['src_tokens'],
            use_cache=False,
            return_dict=False,
        )
        loss = torch.sum(chunk_linear_cross_entropy(outputs[0], self.model.lm_head.weight, samples['target'], ...))
        return loss, samples['ntokens'], samples['nsentences']

def main(args):
    # data config
    dataloader_config = ...

    # model config
    model_config = ModelConfig(
        type=WrapperModel,
        args={
            'model_id': args.model_id,
        },
    )
    # optimizer hyperparameters 
    optimizer_config = OptimizerConfig(
        type=MixedPrecisionAdamW,
        args={'lr': 2e-5, 'betas': (0.9, 0.95), 'weight_decay': 0.0, 'fused': True},
        #...
    )
    #...

    # setup trainer with configs of dataloader/model/optimizer, etc. 
    trainer = Trainer(train_args=TrainerArgs(
            #...
            model=model_config,
            optimizer=optimizer_config,
            dataloader=dataloader_config,
            #...
        ))
    trainer.run()

Run the example Llama-3 training

Then we can start the example, and all the parallelization tasks will be finished by nnScaler automatically.

cd examples/llama3_8B_128K

# prepare training data:
python bookcorpus.py --data_path_or_name bookcorpus/bookcorpus --tokenizer_path_or_name meta-llama/Meta-Llama-3-8B-Instruct --save_path ./bookcorpus_llama3_4K --sequence_length 4096

# build the mini model
python create_mini_model.py --model_id meta-llama/Meta-Llama-3-8B-Instruct --output_id ./llama3_mini

#compile and run using data parallelism + zero1
torchrun --nproc_per_node=2 train.py --plan_ngpus 1 --runtime_ngpus 2 --name llama3_debug --model_id ./llama3_mini --dataset_path ./bookcorpus_llama3_4K

Example nanoGPT

We also provide an example to demonstrate how to parallelize a model through a PyTorch Lightning-compatible interface in nnScaler.

Find the nanoGPT example in nnScaler repo:
```
cd examples/nanogpt
```
Install nanoGPT's dependencies:
```
pip install -r requirements.txt
```

Prepare dataset:

python nanoGPT/data/shakespeare_char/prepare.py

Test with Single GPU

Now you can run train_nnscaler.py with torchrun <https://pytorch.org/docs/stable/elastic/run.html>:

torchrun --nproc_per_node=1 train_nnscaler.py nanoGPT/config/train_shakespeare_char.py

This will train a baby GPT model on a single GPU. It will take several minutes and the best validation loss will be around 1.47.

Test with Multi-GPU

By default, nnScaler parallelizes a model over GPUs with data parallelism. If you have 4 GPUs on one node:

torchrun --nproc_per_node=4 train_nnscaler.py nanoGPT/config/train_shakespeare_char.py

Or if you have multiple nodes, for example 2 nodes with 4 GPUs each:

# on each node
torchrun --nnodes=2 --nproc_per_node=4 --rdzv-id=NNSCALER_NANOGPT --rdzv-backend=c10d --rdzv-endpoint=<IP> \
    train_nnscaler.py nanoGPT/config/train_shakespeare_char.py

NOTE: The local batch size is fixed by default, so using more workers will result in a larger global batch size.

💡 For advanced usages, please stay tuned for our future release.

Success Stories

nnScaler has been adopted by multiple projects, including both product and research explorations:

(YOCO)You only cache once: Decoder-decoder architectures for language models
LongRoPE: Extending LLM context window beyond 2 million tokens
Post training for the long context version of Phi-3 series

Reference

You may find the Artifact Evaluation for OSDI'24 with the guidance here. Please cite nnScaler in your publications if it helps your research:

@inproceedings{lin2024nnscaler,
title = {nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training},
author={Lin, Zhiqi and Miao, Youshan and Zhang, Quanlu and Yang, Fan and Zhu, Yi and Li, Cheng and Maleki, Saeed and Cao, Xu and Shang, Ning and Yang, Yilei and Xu, Weijiang and Yang, Mao and Zhang, Lintao and Zhou, Lidong},
booktitle={18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)},
pages={347--363},
year={2024}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information, see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos is subject to those third-party's policies.

Contact

You may find our public repo from https://github.com/microsoft/nnscaler or microsoft internal repo https://aka.ms/ms-nnscaler. For any questions or inquiries, please contact us at nnscaler@service.microsoft.com.

microsoft / nnscaler

readme