microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.11k stars 1.05k forks source link

How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost #936

Open lovedoubledan opened 6 days ago

lovedoubledan commented 6 days ago

When I use default command, it seems to use 29500 as master_port. However, the master_port seems unchangable,even when I use "--master_port 29501" or change it using "deepspeed.init_distributed(dist_backend='nccl', distributed_port=config.master_port)"

error message: [W1120 21:36:50.764587163 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 127.0.0.1:29500 - retrying (try=3, timeout=1800000ms, delay=1496ms): Connection reset by peer Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:667 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc06bba0446 in /data/wujiahao/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/lib/libc10.so) ...

lovedoubledan commented 6 days ago

I find master_port is specified in constants.py as 29500 and can not be changed by any surface. I hope this bug can be fixed.

tjruwase commented 5 days ago

@lovedoubledan, can you share your full command-line to show the example code?

lovedoubledan commented 5 days ago

@lovedoubledan, can you share your full command-line to show the example code? my command line is : deepspeed --include localhost:4,7 train_stage2.py \ --config_file config/gptir3_notokenloss_plus.yaml \ --deepspeed --deepspeed_config config/deepspeed_config/gptir.json --master_port 20815

and my code is like: parser = argparse.ArgumentParser()

Input Parameters

parser.add_argument('--config_file', type=str, default="config/gptir3_notokenloss_plus.yaml")
parser.add_argument("--local_rank",
                    type=int,
                    default=-1,
                    help="local_rank for distributed training on gpus")
parser.add_argument("--master_port",
                    type=int,
                    default=20815)
parser = deepspeed.add_config_arguments(parser)
# parser.add_argument('--deepspeed_config', type=str, default="config/deepspeed_config/gptir.json")
config = parser.parse_args()

... modelengine, optimizer, , _ = deepspeed.initialize(args=config, model=net, model_parameters=net.configure_parameters(), distributed_port=config.master_port)

loadams commented 21 hours ago

Hi @lovedoubledan - can you share a repro code snippet with us? Also do you see any warnings printed about the port? And could you try setting the master port in the ds_config as well to see if that works?