[BugReport] sumbitit_local won't work for more than 1 GPUs

[BugReport] sumbitit_local won't work for more than 1 GPUs :confused: .
[2024-11-06 13:33:45,945][root][INFO] - Exporting PyTorch distributed environment variables.
[2024-11-06 13:33:45,945][root][INFO] - SLURM detected!
[2024-11-06 13:33:45,969][root][INFO] - MASTER_ADDR:
    gpu2103
[2024-11-06 13:33:45,969][root][INFO] - MASTER_PORT:
    25425
[2024-11-06 13:33:45,970][root][INFO] - Process group:
    2 tasks
[2024-11-06 13:33:45,970][root][INFO] -     rank: 0
[2024-11-06 13:33:45,970][root][INFO] -     world size: 2
[2024-11-06 13:33:45,970][root][INFO] -     local rank: 0
[2024-11-06 13:33:45,971][root][ERROR] - Error setting up distributed hardware. Falling back to default GPU configuration.
Traceback (most recent call last):
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/stable_ssl/base.py", line 499, in _set_device
    self.config.hardware = setup_distributed(self.config.hardware)
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/stable_ssl/utils/utils.py", line 101, in setup_distributed
    torch.distributed.init_process_group(
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
    func_return = func(*args, **kwargs)
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1520, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/rendezvous.py", line 221, in _tcp_rendezvous_handler
    store = _create_c10d_store(
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/rendezvous.py", line 189, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. port: 25425, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
[2024-11-06 13:33:46,033][root][INFO] - GPU info (nvidia-smi):
[2024-11-06 13:33:46,033][root][INFO] -     NVIDIA GeForce RTX 3090, 24576 MiB, P2, 4, GPU-5d8fa3d4-f78d-6100-dbbc-4382e6fe64ae, 00000000:25:00.0
NVIDIA GeForce RTX 3090, 24576 MiB, P8, 4, GPU-62c84bb3-c568-5485-7888-a55548254a6f, 00000000:A1:00.0
rbalestr-lab / stable-SSL

[BugReport] sumbitit_local won't work for more than 1 GPUs #71