If you have any questions about VPipe, please contact sxzhao@cs.hku.hk for quick response.
Repo architecture
runtime: contains our initial system and initial results
cpm: GPT-2 workaround on a Chinese dataset. Under active development to make vPipe support 3-D parallellism, NCCL backend, and dynamic scaling.
For multi-node, make sure nv_peer_mem driver is installed to achieve optimal communication performance.
Note that you should modify the docker base image version to the Nvidia pytorch docker release 20.01.
This may help you avoid an issue caused by the PyTorch variable version checking.
Docker file refer to : https://github.com/NVIDIA/DeepLearningExamples/blob/24b8c9c7fdfd1fa5b80d5c342f96dd922feffd24/PyTorch/LanguageModeling/BERT/Dockerfile
BERT pre-training uses the following datasets:
To download, verify, extract the datasets, and create the shards in .hdf5
format, see:
cd runtime
VPipe's optimal configuration for 8 GPUs
python driver.py --config_file configs/bert_8vpipe.yml
PipeDream's optimal configuration for 8 GPUs
python driver.py --config_file configs/bert_8pipedream.yml
GPipe's optimal configuration for 8 GPUs
python driver.py --config_file configs/bert_8gpipe.yml
Environment: 2 host with each 4 RTX 2080 ti GPUs
Epoch hour:
vPipe: 1.28 hour GPipe: 1.72 hour Pipedream: 2.14 hour