kakou34 commented 1 year ago

Hi, thank you for the quick response and maintaining the amazing repo!

I have a server with 4 GPUs. I want to use the 4 of them so I set $NGPU to 4 when running train_vae.sh. However the process initialization gets stuck. you can see my log below

` 2023-03-16 22:26:34.640 | INFO | main:get_args:206 - EXP_ROOT: /LION/trials + exp name: 0316/colon/f14d9fh_hvae_lion_B8, save dir: /LION/trials/0316/colon/f14d9fh_hvae_lion_B8

2023-03-16 22:26:34.820 | INFO | main:get_args:211 - save config at /LION/trials/0316/colon/f14d9fh_hvae_lion_B8/cfg.yml

2023-03-16 22:26:34.821 | INFO | main:get_args:214 - log dir: /LION/trials/0316/colon/f14d9fh_hvae_lion_B8

2023-03-16 22:26:34.862 | INFO | main::228 - In Rank=0

2023-03-16 22:26:34.892 | INFO | main::234 - Node rank 0, local proc 0, global proc 0

2023-03-16 22:26:34.937 | INFO | main::228 - In Rank=1

2023-03-16 22:26:34.941 | DEBUG | utils.utils:init_processes:1141 - set port as 6010

2023-03-16 22:26:34.952 | INFO | main::234 - Node rank 0, local proc 1, global proc 1

2023-03-16 22:26:34.953 | INFO | utils.utils:init_processes:1152 - init_process: rank=0, world_size=4

2023-03-16 22:26:34.967 | INFO | main::228 - In Rank=2

2023-03-16 22:26:34.971 | DEBUG | utils.utils:init_processes:1141 - set port as 6010

2023-03-16 22:26:34.983 | INFO | main::234 - Node rank 0, local proc 2, global proc 2

2023-03-16 22:26:34.983 | INFO | utils.utils:init_processes:1152 - init_process: rank=1, world_size=4

2023-03-16 22:26:34.998 | INFO | main::228 - In Rank=3

2023-03-16 22:26:35.002 | DEBUG | utils.utils:init_processes:1141 - set port as 6010

2023-03-16 22:26:35.013 | INFO | main::234 - Node rank 0, local proc 3, global proc 3

2023-03-16 22:26:35.013 | INFO | utils.utils:init_processes:1152 - init_process: rank=2, world_size=4

2023-03-16 22:26:35.056 | INFO | main::242 - join 3

2023-03-16 22:26:35.060 | DEBUG | utils.utils:init_processes:1141 - set port as 6011

2023-03-16 22:26:35.073 | INFO | utils.utils:init_processes:1152 - init_process: rank=3, world_size=4 `

Nothing happens after this. I am using Docker. Do you have an idea on how to solve this problem? thank you in advance!

ZENGXH commented 1 year ago

Is single gpu jobs work on your server?

sometimes it can stuck in building the package: I usually run build_pkg.py before I launch the jobs.

If it's still hanging, could you try ctrl+c to kill the run and see where the job is hanging at from the error message.

yufeng9819 commented 1 year ago

Hey, I meet the same problem as you mentioned above.

After checking code, I found that it may be caused by function 'init_processes' in utils/utils.py.

I doubt that if it is necessary to bind socket port manually, since it may cause port conflict when multi gpu used.

So I just solve this problem by commenting the following code:

` if args.num_proc_node == 1: import socket import errno a_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) for p in range(6040, 6060): location = (args.master_address, p) # "127.0.0.1", p) try: a_socket.bind((args.master_address, p)) logger.debug('set port as {}', p) os.environ['MASTER_PORT'] = '%d' % p a_socket.close() break except socket.error as e: a = 0

if e.errno == errno.EADDRINUSE:

            #    # logger.debug("Port {} is already in use", p)
            # else:
            #    logger.debug(e)`

Could you please explain why you have to bind port when distributed training? It dose not seem to be a routine operation.@ZENGXH

ZENGXH commented 1 year ago

@yufeng9819 Thanks for figuring out this issue! I will update the code and comment out this code: it's originally trying to find a port that is available, since sometimes the default port may be used by other jobs.

yufeng9819 commented 1 year ago

By the way, I want to ask if the released checkpoints available now. If so, could you please tell me how to download it cause I can not find a way to get it.@ZENGXH

ZENGXH commented 1 year ago

It's not available yet. I am still working on getting the permission to release it.

kakou34 commented 1 year ago

Hey, I meet the same problem as you mentioned above.

After checking code, I found that it may be caused by function 'init_processes' in utils/utils.py.

I doubt that if it is necessary to bind socket port manually, since it may cause port conflict when multi gpu used.

So I just solve this problem by commenting the following code:

if args.num_proc_node == 1: import socket import errno a_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) for p in range(6040, 6060): location = (args.master_address, p) # "127.0.0.1", p) try: a_socket.bind((args.master_address, p)) logger.debug('set port as {}', p) os.environ['MASTER_PORT'] = '%d' % p a_socket.close() break except socket.error as e: a = 0 # if e.errno == errno.EADDRINUSE: # # logger.debug("Port {} is already in use", p) # else: # logger.debug(e)

Could you please explain why you have to bind port when distributed training? It dose not seem to be a routine operation.@ZENGXH

Indeed this worked! thanks

nv-tlabs / LION

Multiple GPU usage problem #32

if e.errno == errno.EADDRINUSE: