Closed guanyonglai closed 1 year ago
OK,the order is: python -m torch.distributed.launch --nproc_per_node 2 pretrain_gpt2.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 2 --scatter-gather-tensors-in-pipeline --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 4 --global-batch-size 512 --lr 0.00015 --train-iters 500000 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 0.00001 --lr-warmup-fraction 0.01 --data-path '/data/wikitext-103/wiki.train.tokens' --vocab-file gpt2-vocab.json --merge-file gpt2-merges.txt --split 949,50,1 --log-interval 1 --clip-grad 1.0 --fp16 --DDP-impl local --loss-scale 16384 --apply-query-key-layer-scaling --bias-gelu-fusion --bias-dropout-fusion --exit-interval 320000 --save './checkpoints_tmp' --save-interval 1 --load './checkpoints_tmp' --pipeline-no-flushes --checkpoint-activations --checkpoint-num-layers 1
Hi, I found the 2bw code in scripts\ driver_sweap.py to be very complex, requiring both Amazon's cloud servers and frequent calls to docker. Is there any 2bw code that will run on the native GPU?