hpcaitech / FastFold

Optimizing AlphaFold Training and Inference on GPU Clusters
Apache License 2.0
562 stars 86 forks source link

Allow running multiple jobs in the same node #159

Closed s-kyungyong closed 1 year ago

s-kyungyong commented 1 year ago

Hi!

It looks like when there are two fastfold running in the same node, this error appears. Is it possible to allow fastfold to find any other available port to avoid this issue?

WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
running in multimer mode...
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:18417 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:18417 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 548, in <module>
    main(args)
  File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 164, in main
    inference_multimer_model(args)
  File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 293, in inference_multimer_model
    torch.multiprocessing.spawn(inference_model, nprocs=args.gpus, args=(args.gpus, result_q, batch, args))
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 127, in inference_model
    fastfold.distributed.init_dap()
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/distributed/core.py", line 39, in init_dap
    colossalai.launch_from_torch(
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/initialize.py", line 219, in launch_from_torch
    launch(config=config,
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/initialize.py", line 99, in launch
    gpc.init_global_dist(rank, world_size, backend, host, port)
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py", line 374, in init_global_dist
    dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method)
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 212, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 188, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:18417 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:18417 (errno: 98 - Address already in use).
s-kyungyong commented 1 year ago

This seems to relavant with this issue: https://github.com/pytorch/pytorch/issues/73320

I believe running torchrun --rdzv_backend c10d inference.py instead of python inference.py resolve this issue.