Open BAAI-WuDao opened 2 years ago
Hi WUDAO, could you provide your experiment set up? Like parameters used in pretrain_gpt_distributed.sh, how many machines and how many GPUs per machine?
Hi, the pretrain_gpt_distributed.sh is set up as
DATA_PATH='/data/wang/models/Sailing/examples/gpt2' CHECKPOINT_PATH='./'
export WORKER_0_HOST=localhost export WORKER_0_PORT=6000 export NUM_WORKER=1 export WORKER_RANK=0 export GPU_PER_WORKER=4
MASTER_PORT=6002 MASTER_ADDR=$WORKER_0_HOST
GPUS_PER_NODE=$GPU_PER_WORKER
NNODES=$NUM_WORKER NODE_RANK=$WORKER_RANK
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
base_dir=$(cd dirname $0
; pwd)
echo base_dir $base_dir
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
ds_config='{ "train_micro_batch_size_per_gpu":16, "train_batch_size" : 16, "gradient_accumulation_steps": 2, "steps_per_print": 1, "gradient_clipping": 1.0, "zero_optimization": { "stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients" : true, "cpu_offload": false }, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "wall_clock_breakdown": true }'
python3 -m torch.distributed.launch $DISTRIBUTED_ARGS \ --no_python --use_env python3 \ ${base_dir}/pretrain_gpt2.py \ --model-parallel-size 2 \ --num-stages 2 \ --num-layers 24 \ --hidden-size 1024 \ --train-batch-size 64 \ --gradient_accumulation_steps 16 \ --num-attention-heads 16 \ --batch-size 4 \ --seq-length 1024 \ --max-position-embeddings 1024 \ --train-iters 500000 \ --lr-decay-iters 450000 \ --save $CHECKPOINT_PATH \ --load $CHECKPOINT_PATH \ --data-path $DATA_PATH/my-gpt2_text_document \ --vocab-file $DATA_PATH/gpt2-vocab.json \ --merge-file $DATA_PATH/gpt2-merges.txt \ --data-impl mmap \ --split 949,50,1 \ --distributed-backend nccl \ --lr 0.00025 \ --lr-decay-style cosine \ --min-lr 1.0e-5 \ --weight-decay 1e-2 \ --clip-grad 1.0 \ --warmup .02 \ --log-interval 1 \ --save-interval 100000 \ --vocab-size 145608 \ --DDP-impl torch \ --eod-mask-loss \ --deepspeed-pipeline \ --deepspeed \ --config_param "$ds_config" \ --fp16 \ --partition_method "type:ParallelTransformerLayerPiped" \ $@ set +x
Hi WuDao,
Sorry, I am not able to repo your issue with provided setting. I suggest do following two steps to exclude some possible issue:
export WORKER_0_HOST=127.0.0.1
export DMLC_NODE_HOST=127.0.0.1
export BYTEPS_WITH_UCX=0
export DMLC_ENABLE_UCX=0
export DMLC_ENABLE_RDMA=0
FROM nvcr.io/nvidia/pytorch:21.05-py3
RUN pip3 install boto3 regex tensorboardX==1.8 wheel pybind11 ninja psutil pyprof
RUN apt-get -yq autoremove --purge ibverbs-providers
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends --allow-downgrades \
libibverbs-dev=28.0-1ubuntu1 libibverbs1=28.0-1ubuntu1
RUN apt-get update && \ DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends --allow-downgrades \ cmake \ libopenmpi-dev \ openmpi-bin \ openssh-client \ openssh-server \ ibverbs-providers \ libibverbs-dev=28.0-1ubuntu1 \ librdmacm-dev \ vim \ iputils-ping \ llvm-10-dev \ iproute2 \ unzip
RUN ln -s /usr/bin/aclocal-1.16 /usr/local/bin/aclocal-1.14 RUN ln -s /usr/bin/automake /usr/local/bin/automake-1.14
ENV LD_LIBRARY_PATH "/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}" ENV BYTEPS_WITH_UCX 0
RUN pip3 install https://giant-model-package.tos-cn-beijing.volces.com/byteps-0.7.2-cp38-cp38-linux_x86_64.whl WORKDIR /root
> Hi, the pretrain_gpt_distributed.sh is set up as
>
> #! /bin/bash
>
> # Runs the "345M" parameter model
> DATA_PATH='/data/wang/models/Sailing/examples/gpt2' CHECKPOINT_PATH='./'
>
> export WORKER_0_HOST=localhost export WORKER_0_PORT=6000 export NUM_WORKER=1 export WORKER_RANK=0 export GPU_PER_WORKER=4
>
> MASTER_PORT=6002 MASTER_ADDR=$WORKER_0_HOST
>
> GPUS_PER_NODE=$GPU_PER_WORKER
>
> NNODES=$NUM_WORKER NODE_RANK=$WORKER_RANK
>
> WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
>
> base_dir=$(cd `dirname $0`; pwd) echo base_dir $base_dir
>
> DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
>
> ds_config='{ "train_micro_batch_size_per_gpu":16, "train_batch_size" : 16, "gradient_accumulation_steps": 2, "steps_per_print": 1, "gradient_clipping": 1.0, "zero_optimization": { "stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients" : true, "cpu_offload": false }, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "wall_clock_breakdown": true }'
>
> python3 -m torch.distributed.launch $DISTRIBUTED_ARGS --no_python --use_env python3 ${base_dir}/pretrain_gpt2.py --model-parallel-size 2 --num-stages 2 --num-layers 24 --hidden-size 1024 --train-batch-size 64 --gradient_accumulation_steps 16 --num-attention-heads 16 --batch-size 4 --seq-length 1024 --max-position-embeddings 1024 --train-iters 500000 --lr-decay-iters 450000 --save $CHECKPOINT_PATH --load $CHECKPOINT_PATH --data-path $DATA_PATH/my-gpt2_text_document --vocab-file $DATA_PATH/gpt2-vocab.json --merge-file $DATA_PATH/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 0.00025 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup .02 --log-interval 1 --save-interval 100000 --vocab-size 145608 --DDP-impl torch --eod-mask-loss --deepspeed-pipeline --deepspeed --config_param "$ds_config" --fp16 --partition_method "type:ParallelTransformerLayerPiped" $@ set +x
Thank you very much for you response! I follow the steps, but the bugs still exists. Can you give a example, that I can set the value of ``mapping" at line 136 in topology.py
Thank you very much for you response! I follow the steps, but the bugs still exists. Can you give a example, that I can set the value of ``mapping" at line 136 in topology.py
Hi WuDao,
Could you paste your full log here? Don't use screen shot, so that I can search text.
When I run bash ``examples/gpt/pretrain_gpt_distributed.sh"
It reports the information
and reports this error
I follows the errors, it seems that the problem is located in topology.py, Line 43,
for I print the variable self.mapping at topology.py, Line 131, it is empty.