pytorch / examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.
https://pytorch.org/examples
BSD 3-Clause "New" or "Revised" License
22.25k stars 9.52k forks source link

Multinode.py example fails #1279

Open rohan-mehta-1024 opened 2 months ago

rohan-mehta-1024 commented 2 months ago

I am using the code from the multinode.py (from this DDP tutorial series https://www.youtube.com/watch?v=KaAJtI1T2x4) file with the following Slurm Script


#SBATCH -N 2
#SBATCH --gres=gpu:volta:1
#SBATCH -c 10

source /etc/profile.d/modules.sh

module load anaconda/2023a
module load cuda/11.6
module load nccl/2.11.4-cuda11.6

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

echo Node IP: $head_node_ip
export LOGLEVEL=INFO
export NCCL_DEBUG=INFO

srun torchrun \
--nnodes 2 \
--nproc_per_node 1 \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:29503 \
multi_tutorial.py 50 10```

However it gives the following error 

```Node IP: 172.31.130.84
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : /home/gridsan/rmehta/potential_function/multi_tutorial.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 1
  run_id           : 7644
  rdzv_backend     : c10d
  rdzv_endpoint    : 172.31.130.84:29503
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : /home/gridsan/rmehta/potential_function/multi_tutorial.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 1
  run_id           : 7644
  rdzv_backend     : c10d
  rdzv_endpoint    : 172.31.130.84:29503
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /state/partition1/slurm_tmp/26645921.1.1/torchelastic_00qy3rwa/7644__t3nqnre
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.9
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[W socket.cpp:426] [c10d] The server socket has failed to listen on [::]:29503 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:29503 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /state/partition1/slurm_tmp/26645921.1.0/torchelastic_sxs5m6o3/7644_v74ll5_1
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.9
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=d-9-11-1.supercloud.mit.edu
  master_port=58139
  group_rank=0
  group_world_size=2
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[2]
  global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=d-9-11-1.supercloud.mit.edu
  master_port=58139
  group_rank=1
  group_world_size=2
  local_ranks=[0]
  role_ranks=[1]
  global_ranks=[1]
  role_world_sizes=[2]
  global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /state/partition1/slurm_tmp/26645921.1.1/torchelastic_00qy3rwa/7644__t3nqnre/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /state/partition1/slurm_tmp/26645921.1.0/torchelastic_sxs5m6o3/7644_v74ll5_1/attempt_0/0/error.json
d-9-11-1:2870757:2870757 [0] NCCL INFO Bootstrap : Using ens2f0:172.31.130.84<0>
d-9-11-1:2870757:2870757 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
d-9-11-1:2870757:2870757 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.14.3+cuda11.6
d-9-11-1:2870756:2870756 [0] NCCL INFO cudaDriverVersion 12020
d-9-11-1:2870756:2870756 [0] NCCL INFO Bootstrap : Using ens2f0:172.31.130.84<0>
d-9-11-1:2870756:2870756 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
d-9-11-1:2870756:2870934 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB ens2f0:172.31.130.84<0>
d-9-11-1:2870756:2870934 [0] NCCL INFO Using network IB
d-9-11-1:2870757:2870933 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB ens2f0:172.31.130.84<0>
d-9-11-1:2870757:2870933 [0] NCCL INFO Using network IB

d-9-11-1:2870757:2870933 [0] init.cc:525 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 86000
d-9-11-1:2870757:2870933 [0] NCCL INFO init.cc:1089 -> 5
d-9-11-1:2870757:2870933 [0] NCCL INFO group.cc:64 -> 5 [Async thread]

d-9-11-1:2870756:2870934 [0] init.cc:525 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 86000
d-9-11-1:2870756:2870934 [0] NCCL INFO init.cc:1089 -> 5
d-9-11-1:2870756:2870934 [0] NCCL INFO group.cc:64 -> 5 [Async thread]
d-9-11-1:2870756:2870756 [0] NCCL INFO group.cc:421 -> 3
d-9-11-1:2870756:2870756 [0] NCCL INFO group.cc:106 -> 3
d-9-11-1:2870757:2870757 [0] NCCL INFO group.cc:421 -> 3
d-9-11-1:2870757:2870757 [0] NCCL INFO group.cc:106 -> 3
d-9-11-1:2870757:2870757 [0] NCCL INFO comm 0x560c6a8aafd0 rank 0 nranks 2 cudaDev 0 busId 86000 - Abort COMPLETE
d-9-11-1:2870756:2870756 [0] NCCL INFO comm 0x55592676f080 rank 1 nranks 2 cudaDev 0 busId 86000 - Abort COMPLETE
Traceback (most recent call last):
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 113, in <module>
Traceback (most recent call last):
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 113, in <module>
    main(args.save_every, args.total_epochs, args.batch_size)
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 100, in main
    trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 39, in __init__
    self.model = DDP(self.model, device_ids=[self.local_rank])
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    main(args.save_every, args.total_epochs, args.batch_size)
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 100, in main
    trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 39, in __init__
    self.model = DDP(self.model, device_ids=[self.local_rank])
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 86000
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 86000
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2870757) of binary: /state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/python3.9
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2870756) of binary: /state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/python3.9
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00014710426330566406 seconds
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004813671112060547 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 0 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/torchrun", line 8, in <module>
Traceback (most recent call last):
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    sys.exit(main())
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    return f(*args, **kwargs)
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
    run(args)
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    elastic_launch(
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/gridsan/rmehta/potential_function/multi_tutorial.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-24_11:39:43
  host      : d-9-11-1.supercloud.mit.edu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2870757)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/gridsan/rmehta/potential_function/multi_tutorial.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-24_11:39:43
  host      : d-9-11-1.supercloud.mit.edu
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 2870756)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: d-9-11-1: tasks 0-1: Exited with exit code 1

I am unsure whether the error is it failing to connect to the port, and this causes the downstream error of the different processes trying to use the same GPU, or if these are two separate errors. I have tried using many different ports, but they all give the same failed to connect error. Again, my code is identical to the one in the multinode.py example. I would appreciate any help trying to get to the bottom of this. Thank you.

msaroufim commented 1 month ago

@subramen

subramen commented 1 month ago

Not a slurm expert but looks like the 2 errors are related to incorrect resource allocation by slurm? I see a similar issue on the forum, can you see if this resolves your issue - https://discuss.pytorch.org/t/error-with-ddp-on-multiple-nodes/195251/4