.. image:: _static/img/AdaptDLHorizLogo.png :align: center
.. image:: https://img.shields.io/github/actions/workflow/status/petuum/adaptdl/test.yaml?branch=master :target: https://github.com/petuum/adaptdl/actions?query=workflow%3ATest :alt: GitHub Workflow Status .. image:: https://codecov.io/gh/petuum/adaptdl/branch/master/graph/badge.svg :target: https://codecov.io/gh/petuum/adaptdl .. image:: https://readthedocs.org/projects/adaptdl/badge/?version=latest :target: https://adaptdl.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status .. image:: https://img.shields.io/pypi/v/adaptdl :target: https://pypi.org/project/adaptdl/ :alt: PyPI
Documentation <https://adaptdl.readthedocs.org>
|
Examples <https://github.com/petuum/adaptdl/tree/master/examples>
|
CASL Project <https://www.casl-project.ai>
_
.. include-start-after
AdaptDL is a resource-adaptive deep learning (DL) training and scheduling
framework, and is part of the CASL open source project <https://www.casl-project.ai>
_. The goal of AdaptDL is to make distributed DL
easy and efficient in dynamic-resource environments such as shared clusters and
the cloud.
AdaptDL consists of two components which can be used together with or separately from one another:
Some core features offered by AdaptDL are:
AdaptDL supports PyTorch training programs. TensorFlow support coming soon!
Efficient Resource Management ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The AdaptDL scheduler directly optimizes cluster-wide training performance and
resource utilization, by using a genetic algorithm to periodically optimize
resource allocations for all jobs. Through elastic re-scaling, co-adapting
batch sizes and learning rates, and avoiding network interference, AdaptDL
significantly accelerates shared-cluster training when compared with alternative
schedulers. For details, please see our OSDI'21 research paper <https://www.usenix.org/conference/osdi21/presentation/qiao>
_.
.. image:: _static/img/scheduling-performance.png :align: center
In the cloud (e.g. AWS), AdaptDL auto-scales the size of the cluster based on how well those cluster resources are utilized. AdaptDL automatically provisions spot instances when available to reduce cost by up to 80%.
Adaptive Batch Size Scaling ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Efficient distributed training requires careful selection of the batch size and
learning rate, which can be tricky to find manually. AdaptDL offers automatic
batch size and learning rate scaling, which enables efficient distributed
training without requiring manual effort. To achieve this, AdaptDL measures the
system performance and gradient noise scale <https://arxiv.org/pdf/1812.06162.pdf>
during training, adaptively selects the most efficient batch size, and scales
the learning rate using AdaScale <https://arxiv.org/pdf/2007.05105.pdf>
.
.. image:: _static/img/autobsz-performance.png :align: center
Easy-to-use Elastic API ^^^^^^^^^^^^^^^^^^^^^^^
Making training programs run elastically can be challenging and error-prone. AdaptDL offers APIs which make it easy to enable elasticity for data-parallel PyTorch programs. Simply change a few lines of code, without heavy refactoring!
BEFORE:
.. code-block:: python
torch.distributed.init_process_group("nccl") model = torch.nn.parallel.DistributedDataParallel(model) dataloader = torch.utils.data.DataLoader(dataset, batch_size=128) for epoch in range(100): ...
AFTER:
.. code-block:: python
adaptdl.torch.init_process_group("nccl") model = adaptdl.torch.AdaptiveDataParallel(model, optimizer) dataloader = adaptdl.torch.AdaptiveDataLoader(dataset, batch_size=128) for epoch in adaptdl.torch.remaining_epochs_until(100): ...
.. include-end-before
AdaptDL consists of a Kubernetes job scheduler and an adaptive training library. They can be used in two ways:
Scheduler Installation <https://adaptdl.readthedocs.io/en/latest/installation/index.html>
_).Standalone Training <https://adaptdl.readthedocs.io/en/latest/standalone-training.html>
_)... image:: _static/img/Petuum.png :align: center