Sky Computing is a load-balanced framework for federated learning model parallelism. It adaptively allocate model layers to devices based on the their hardware sepcification. Sky Computing outperforms the baseline method by 55% in training time when training 160-layer BERT in a 64-node cluster. Our paper can be found at https://arxiv.org/abs/2202.11836
The concept sky computing was first introduced by Dr. Katarzyna Keahey et al. They used this word to describe a cross-cloud compute pattern. And later Prof. Stoica and Prof. Shenker generalized this word to geo-distributed computing. Our project is based on their definition. [1] [2]
git clone git@github.com:hpcaitech/SkyComputing.git
python -m pip install -r requirements.txt
cd ./scaelum
python -m pip install -v -e .
To benchmark the Sky Computing, we prepared a single demo which you can run on your cluster to train BERT.
Bidirectional Encoder Representations from Transformers (aka BERT) is one of the state-of-the-art deep learning models for Natural Language Processing. In the experiment part, we use BERT to run a simple benchmark.
cd $PROJECT
mkdir -p BERT/model && cd BERT/model
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip
The General Language Understanding Evaluation (aka GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. And the Multi-Genre Natural Language Inference (aka MNLI) is one of the tasks in GLUE, it is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information.
cd $PROJECT
mkdir -p BERT/data && cd BERT/data
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/1502038877f6a88c225a34450793fbc3ea87eaba/download_glue_data.py
python download_glue_data.py --data_dir ./glue_data --tasks MNLI
To run dllb in your cluster, you need to write a config file which contains the necessary information about training, e.g. model layers, useful environment variables. We have provided a well-commentted example, and here are some most important option:
# your project path
PROJECT = os.getenv("PROJECT")
# allocation type, valid values are even, optimal and dynamic
ALLOCATE_TYPE = "even"
# num of node (including the central server)
CORE_NUM = 4
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We used slurm script to run our experiment.
#!/bin/sh
#SBATCH --job-name=gpu16 # Job name
#SBATCH -o gpu16.o%j # Name of stdout output file
#SBATCH -e gpu16.e%j # Name of stderr error file
#SBATCH -N 16 # Node numbers
#SBATCH -n 16 # GPU numbers
#SBATCH --time=02:00:00 # Run time (hh:mm:ss)
# run
python ./ip_addr.py > "./HOST"
srun python ./launch.py -c "./experiment/config.py"
@misc{zhu2022sky,
title={Sky Computing: Accelerating Geo-distributed Computing in Federated Learning},
author={Jie Zhu and Shenggui Li and Yang You},
year={2022},
eprint={2202.11836},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@article{keahey2009sky,
title={Sky computing},
author={Keahey, Katarzyna and Tsugawa, Mauricio and Matsunaga, Andrea and Fortes, Jose},
journal={IEEE Internet Computing},
volume={13},
number={5},
pages={43--51},
year={2009},
publisher={IEEE}
}
@inproceedings{stoica2021cloud,
title={From cloud computing to sky computing},
author={Stoica, Ion and Shenker, Scott},
booktitle={Proceedings of the Workshop on Hot Topics in Operating Systems},
pages={26--32},
year={2021}
}