It looks like when there are two fastfold running in the same node, this error appears. Is it possible to allow fastfold to find any other available port to avoid this issue?
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
running in multimer mode...
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:18417 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:18417 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 548, in <module>
main(args)
File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 164, in main
inference_multimer_model(args)
File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 293, in inference_multimer_model
torch.multiprocessing.spawn(inference_model, nprocs=args.gpus, args=(args.gpus, result_q, batch, args))
File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 127, in inference_model
fastfold.distributed.init_dap()
File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/distributed/core.py", line 39, in init_dap
colossalai.launch_from_torch(
File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/initialize.py", line 219, in launch_from_torch
launch(config=config,
File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/initialize.py", line 99, in launch
gpc.init_global_dist(rank, world_size, backend, host, port)
File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py", line 374, in init_global_dist
dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method)
File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 212, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 188, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:18417 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:18417 (errno: 98 - Address already in use).
Hi!
It looks like when there are two fastfold running in the same node, this error appears. Is it possible to allow fastfold to find any other available port to avoid this issue?