rbalestr-lab / stable-SSL

https://rbalestr-lab.github.io/stable-SSL.github.io/dev/
MIT License
4 stars 2 forks source link

[BugReport] sumbitit_local won't work for more than 1 GPUs #71

Open vipulSharma18 opened 2 weeks ago

vipulSharma18 commented 2 weeks ago

[BugReport] sumbitit_local won't work for more than 1 GPUs :confused: .

[2024-11-06 13:33:45,945][root][INFO] - Exporting PyTorch distributed environment variables.
[2024-11-06 13:33:45,945][root][INFO] - SLURM detected!
[2024-11-06 13:33:45,969][root][INFO] - MASTER_ADDR:
    gpu2103
[2024-11-06 13:33:45,969][root][INFO] - MASTER_PORT:
    25425
[2024-11-06 13:33:45,970][root][INFO] - Process group:
    2 tasks
[2024-11-06 13:33:45,970][root][INFO] -     rank: 0
[2024-11-06 13:33:45,970][root][INFO] -     world size: 2
[2024-11-06 13:33:45,970][root][INFO] -     local rank: 0
[2024-11-06 13:33:45,971][root][ERROR] - Error setting up distributed hardware. Falling back to default GPU configuration.
Traceback (most recent call last):
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/stable_ssl/base.py", line 499, in _set_device
    self.config.hardware = setup_distributed(self.config.hardware)
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/stable_ssl/utils/utils.py", line 101, in setup_distributed
    torch.distributed.init_process_group(
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
    func_return = func(*args, **kwargs)
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1520, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/rendezvous.py", line 221, in _tcp_rendezvous_handler
    store = _create_c10d_store(
  File "/oscar/home/vsharm44/projects/ssl-perturbation-augmentation/.venv/lib64/python3.9/site-packages/torch/distributed/rendezvous.py", line 189, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. port: 25425, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
[2024-11-06 13:33:46,033][root][INFO] - GPU info (nvidia-smi):
[2024-11-06 13:33:46,033][root][INFO] -     NVIDIA GeForce RTX 3090, 24576 MiB, P2, 4, GPU-5d8fa3d4-f78d-6100-dbbc-4382e6fe64ae, 00000000:25:00.0
NVIDIA GeForce RTX 3090, 24576 MiB, P8, 4, GPU-62c84bb3-c568-5485-7888-a55548254a6f, 00000000:A1:00.0
vipulSharma18 commented 2 weeks ago

opinion: If user uses submitit then we should use its job environment to get torch distributed configs. Trying to overwrite the configs with slurm or local system's params leads to bugs and unexpected behavior.