Closed kakou34 closed 1 year ago
Is single gpu jobs work on your server?
sometimes it can stuck in building the package: I usually run build_pkg.py
before I launch the jobs.
If it's still hanging, could you try ctrl+c to kill the run and see where the job is hanging at from the error message.
Hey, I meet the same problem as you mentioned above.
After checking code, I found that it may be caused by function 'init_processes' in utils/utils.py.
I doubt that if it is necessary to bind socket port manually, since it may cause port conflict when multi gpu used.
So I just solve this problem by commenting the following code:
` if args.num_proc_node == 1: import socket import errno a_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) for p in range(6040, 6060): location = (args.master_address, p) # "127.0.0.1", p) try: a_socket.bind((args.master_address, p)) logger.debug('set port as {}', p) os.environ['MASTER_PORT'] = '%d' % p a_socket.close() break except socket.error as e: a = 0
# # logger.debug("Port {} is already in use", p)
# else:
# logger.debug(e)`
Could you please explain why you have to bind port when distributed training? It dose not seem to be a routine operation.@ZENGXH
@yufeng9819 Thanks for figuring out this issue! I will update the code and comment out this code: it's originally trying to find a port that is available, since sometimes the default port may be used by other jobs.
By the way, I want to ask if the released checkpoints available now. If so, could you please tell me how to download it cause I can not find a way to get it.@ZENGXH
It's not available yet. I am still working on getting the permission to release it.
Hey, I meet the same problem as you mentioned above.
After checking code, I found that it may be caused by function 'init_processes' in utils/utils.py.
I doubt that if it is necessary to bind socket port manually, since it may cause port conflict when multi gpu used.
So I just solve this problem by commenting the following code:
if args.num_proc_node == 1: import socket import errno a_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) for p in range(6040, 6060): location = (args.master_address, p) # "127.0.0.1", p) try: a_socket.bind((args.master_address, p)) logger.debug('set port as {}', p) os.environ['MASTER_PORT'] = '%d' % p a_socket.close() break except socket.error as e: a = 0 # if e.errno == errno.EADDRINUSE: # # logger.debug("Port {} is already in use", p) # else: # logger.debug(e)
Could you please explain why you have to bind port when distributed training? It dose not seem to be a routine operation.@ZENGXH
Indeed this worked! thanks
Hi, thank you for the quick response and maintaining the amazing repo!
I have a server with 4 GPUs. I want to use the 4 of them so I set $NGPU to 4 when running train_vae.sh. However the process initialization gets stuck. you can see my log below
` 2023-03-16 22:26:34.640 | INFO | main:get_args:206 - EXP_ROOT: /LION/trials + exp name: 0316/colon/f14d9fh_hvae_lion_B8, save dir: /LION/trials/0316/colon/f14d9fh_hvae_lion_B8
2023-03-16 22:26:34.820 | INFO | main:get_args:211 - save config at /LION/trials/0316/colon/f14d9fh_hvae_lion_B8/cfg.yml
2023-03-16 22:26:34.821 | INFO | main:get_args:214 - log dir: /LION/trials/0316/colon/f14d9fh_hvae_lion_B8
2023-03-16 22:26:34.862 | INFO | main::228 - In Rank=0
2023-03-16 22:26:34.892 | INFO | main::234 - Node rank 0, local proc 0, global proc 0
2023-03-16 22:26:34.937 | INFO | main::228 - In Rank=1
2023-03-16 22:26:34.941 | DEBUG | utils.utils:init_processes:1141 - set port as 6010
2023-03-16 22:26:34.952 | INFO | main::234 - Node rank 0, local proc 1, global proc 1
2023-03-16 22:26:34.953 | INFO | utils.utils:init_processes:1152 - init_process: rank=0, world_size=4
2023-03-16 22:26:34.967 | INFO | main::228 - In Rank=2
2023-03-16 22:26:34.971 | DEBUG | utils.utils:init_processes:1141 - set port as 6010
2023-03-16 22:26:34.983 | INFO | main::234 - Node rank 0, local proc 2, global proc 2
2023-03-16 22:26:34.983 | INFO | utils.utils:init_processes:1152 - init_process: rank=1, world_size=4
2023-03-16 22:26:34.998 | INFO | main::228 - In Rank=3
2023-03-16 22:26:35.002 | DEBUG | utils.utils:init_processes:1141 - set port as 6010
2023-03-16 22:26:35.013 | INFO | main::234 - Node rank 0, local proc 3, global proc 3
2023-03-16 22:26:35.013 | INFO | utils.utils:init_processes:1152 - init_process: rank=2, world_size=4
2023-03-16 22:26:35.056 | INFO | main::242 - join 3
2023-03-16 22:26:35.060 | DEBUG | utils.utils:init_processes:1141 - set port as 6011
2023-03-16 22:26:35.073 | INFO | utils.utils:init_processes:1152 - init_process: rank=3, world_size=4 `
Nothing happens after this. I am using Docker. Do you have an idea on how to solve this problem? thank you in advance!