Why worker has init container wait for master ready?

kubeflow / pytorch-operator

PyTorch on Kubernetes

Apache License 2.0

306 stars 143 forks source link

Why worker has init container wait for master ready? #279

Open jiaqianjing opened 4 years ago

jiaqianjing commented 4 years ago

why not set large timeout at torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')? What's the meaning of adding this？

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/question	0.69

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

gaocegege commented 4 years ago

torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')

I think it is a user-level config. We cannot rely on it at the system level.

jiaqianjing commented 4 years ago

I think so, but looks like a little weak. Are there any other considerations？