BAAI-WuDao commented 2 years ago

When I run bash ``examples/gpt/pretrain_gpt_distributed.sh"

It reports the information

and reports this error

I follows the errors, it seems that the problem is located in topology.py, Line 43,

  for global_rank, coord in enumerate(cartesian_product(*ranges)):
      key = {axis: coord[self.axes.index(axis)] for axis in self.axes}
      key = self.ProcessCoord(**key)
      # for example, {ProcessCoord(row=0, col=1) : 1}
      self.mapping[key] = global_rank

for I print the variable self.mapping at topology.py, Line 131, it is empty.

gongwei-130 commented 2 years ago

Hi WUDAO, could you provide your experiment set up? Like parameters used in pretrain_gpt_distributed.sh, how many machines and how many GPUs per machine?

BAAI-WuDao commented 2 years ago

Hi， the pretrain_gpt_distributed.sh is set up as

! /bin/bash

Runs the "345M" parameter model

DATA_PATH='/data/wang/models/Sailing/examples/gpt2' CHECKPOINT_PATH='./'

export WORKER_0_HOST=localhost export WORKER_0_PORT=6000 export NUM_WORKER=1 export WORKER_RANK=0 export GPU_PER_WORKER=4

MASTER_PORT=6002 MASTER_ADDR=$WORKER_0_HOST

GPUS_PER_NODE=$GPU_PER_WORKER

NNODES=$NUM_WORKER NODE_RANK=$WORKER_RANK

WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

base_dir=$(cd dirname $0; pwd) echo base_dir $base_dir

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

ds_config='{ "train_micro_batch_size_per_gpu":16, "train_batch_size" : 16, "gradient_accumulation_steps": 2, "steps_per_print": 1, "gradient_clipping": 1.0, "zero_optimization": { "stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients" : true, "cpu_offload": false }, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "wall_clock_breakdown": true }'

python3 -m torch.distributed.launch $DISTRIBUTED_ARGS \ --no_python --use_env python3 \ ${base_dir}/pretrain_gpt2.py \ --model-parallel-size 2 \ --num-stages 2 \ --num-layers 24 \ --hidden-size 1024 \ --train-batch-size 64 \ --gradient_accumulation_steps 16 \ --num-attention-heads 16 \ --batch-size 4 \ --seq-length 1024 \ --max-position-embeddings 1024 \ --train-iters 500000 \ --lr-decay-iters 450000 \ --save $CHECKPOINT_PATH \ --load $CHECKPOINT_PATH \ --data-path $DATA_PATH/my-gpt2_text_document \ --vocab-file $DATA_PATH/gpt2-vocab.json \ --merge-file $DATA_PATH/gpt2-merges.txt \ --data-impl mmap \ --split 949,50,1 \ --distributed-backend nccl \ --lr 0.00025 \ --lr-decay-style cosine \ --min-lr 1.0e-5 \ --weight-decay 1e-2 \ --clip-grad 1.0 \ --warmup .02 \ --log-interval 1 \ --save-interval 100000 \ --vocab-size 145608 \ --DDP-impl torch \ --eod-mask-loss \ --deepspeed-pipeline \ --deepspeed \ --config_param "$ds_config" \ --fp16 \ --partition_method "type:ParallelTransformerLayerPiped" \ $@ set +x

gongwei-130 commented 2 years ago

Hi WuDao,

Sorry, I am not able to repo your issue with provided setting. I suggest do following two steps to exclude some possible issue:

Add following into your pretrain_gpt_distributed.sh

   export WORKER_0_HOST=127.0.0.1
   export DMLC_NODE_HOST=127.0.0.1
   export BYTEPS_WITH_UCX=0 
   export DMLC_ENABLE_UCX=0
   export DMLC_ENABLE_RDMA=0

Try setting up environment with following docker file, I will push this docker file into repo soon.


FROM nvcr.io/nvidia/pytorch:21.05-py3 
RUN pip3 install boto3 regex tensorboardX==1.8 wheel pybind11 ninja psutil pyprof
RUN apt-get -yq autoremove --purge ibverbs-providers
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends --allow-downgrades \
 libibverbs-dev=28.0-1ubuntu1 libibverbs1=28.0-1ubuntu1

RUN apt-get update && \ DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends --allow-downgrades \ cmake \ libopenmpi-dev \ openmpi-bin \ openssh-client \ openssh-server \ ibverbs-providers \ libibverbs-dev=28.0-1ubuntu1 \ librdmacm-dev \ vim \ iputils-ping \ llvm-10-dev \ iproute2 \ unzip

RUN ln -s /usr/bin/aclocal-1.16 /usr/local/bin/aclocal-1.14 RUN ln -s /usr/bin/automake /usr/local/bin/automake-1.14

ENV LD_LIBRARY_PATH "/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}" ENV BYTEPS_WITH_UCX 0

RUN pip3 install https://giant-model-package.tos-cn-beijing.volces.com/byteps-0.7.2-cp38-cp38-linux_x86_64.whl WORKDIR /root



> Hi， the pretrain_gpt_distributed.sh is set up as
> 
> #! /bin/bash
> 
> # Runs the "345M" parameter model
> DATA_PATH='/data/wang/models/Sailing/examples/gpt2' CHECKPOINT_PATH='./'
> 
> export WORKER_0_HOST=localhost export WORKER_0_PORT=6000 export NUM_WORKER=1 export WORKER_RANK=0 export GPU_PER_WORKER=4
> 
> MASTER_PORT=6002 MASTER_ADDR=$WORKER_0_HOST
> 
> GPUS_PER_NODE=$GPU_PER_WORKER
> 
> NNODES=$NUM_WORKER NODE_RANK=$WORKER_RANK
> 
> WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
> 
> base_dir=$(cd `dirname $0`; pwd) echo base_dir $base_dir
> 
> DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
> 
> ds_config='{ "train_micro_batch_size_per_gpu":16, "train_batch_size" : 16, "gradient_accumulation_steps": 2, "steps_per_print": 1, "gradient_clipping": 1.0, "zero_optimization": { "stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients" : true, "cpu_offload": false }, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "wall_clock_breakdown": true }'
> 
> python3 -m torch.distributed.launch $DISTRIBUTED_ARGS --no_python --use_env python3 ${base_dir}/pretrain_gpt2.py --model-parallel-size 2 --num-stages 2 --num-layers 24 --hidden-size 1024 --train-batch-size 64 --gradient_accumulation_steps 16 --num-attention-heads 16 --batch-size 4 --seq-length 1024 --max-position-embeddings 1024 --train-iters 500000 --lr-decay-iters 450000 --save $CHECKPOINT_PATH --load $CHECKPOINT_PATH --data-path $DATA_PATH/my-gpt2_text_document --vocab-file $DATA_PATH/gpt2-vocab.json --merge-file $DATA_PATH/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 0.00025 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup .02 --log-interval 1 --save-interval 100000 --vocab-size 145608 --DDP-impl torch --eod-mask-loss --deepspeed-pipeline --deepspeed --config_param "$ds_config" --fp16 --partition_method "type:ParallelTransformerLayerPiped" $@ set +x

BAAI-WuDao commented 2 years ago

Thank you very much for you response! I follow the steps, but the bugs still exists. Can you give a example, that I can set the value of ``mapping" at line 136 in topology.py

gongwei-130 commented 2 years ago

Thank you very much for you response! I follow the steps, but the bugs still exists. Can you give a example, that I can set the value of ``mapping" at line 136 in topology.py

Hi WuDao,

Could you paste your full log here? Don't use screen shot, so that I can search text.

volcengine / veGiantModel

ValueError: rank 1 not found in topology #2

! /bin/bash

Runs the "345M" parameter model