Open vipulSharma18 opened 2 weeks ago
[BugReport] sumbitit_local won't work for more than 1 GPUs :confused: .
[2024-11-06 13:33:45,945][root][INFO] - Exporting PyTorch distributed environment variables. [2024-11-06 13:33:45,945][root][INFO] - SLURM detected! [2024-11-06 13:33:45,969][root][INFO] - MASTER_ADDR: gpu2103 [2024-11-06 13:33:45,969][root][INFO] - MASTER_PORT: 25425 [2024-11-06 13:33:45,970][root][INFO] - Process group: 2 tasks [2024-11-06 13:33:45,970][root][INFO] - rank: 0 [2024-11-06 13:33:45,970][root][INFO] - world size: 2 [2024-11-06 13:33:45,970][root][INFO] - local rank: 0 [2024-11-06 13:33:45,971][root][ERROR] - Error setting up distributed hardware. Falling back to default GPU configuration. Traceback (most recent call last): File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/stable_ssl/base.py", line 499, in _set_device self.config.hardware = setup_distributed(self.config.hardware) File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/stable_ssl/utils/utils.py", line 101, in setup_distributed torch.distributed.init_process_group( File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs) File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper func_return = func(*args, **kwargs) File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1520, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/rendezvous.py", line 221, in _tcp_rendezvous_handler store = _create_c10d_store( File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/rendezvous.py", line 189, in _create_c10d_store return TCPStore( RuntimeError: The server socket has failed to listen on any local network address. port: 25425, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use [2024-11-06 13:33:46,033][root][INFO] - GPU info (nvidia-smi): [2024-11-06 13:33:46,033][root][INFO] - NVIDIA GeForce RTX 3090, 24576 MiB, P2, 4, GPU-5d8fa3d4-f78d-6100-dbbc-4382e6fe64ae, 00000000:25:00.0 NVIDIA GeForce RTX 3090, 24576 MiB, P8, 4, GPU-62c84bb3-c568-5485-7888-a55548254a6f, 00000000:A1:00.0
opinion: If user uses submitit then we should use its job environment to get torch distributed configs. Trying to overwrite the configs with slurm or local system's params leads to bugs and unexpected behavior.
[BugReport] sumbitit_local won't work for more than 1 GPUs :confused: .