kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

Why worker has init container wait for master ready? #279

Open jiaqianjing opened 4 years ago

jiaqianjing commented 4 years ago

image why not set large timeout at torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')? What's the meaning of adding this?

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/question 0.69

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

gaocegege commented 4 years ago

torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')

I think it is a user-level config. We cannot rely on it at the system level.

jiaqianjing commented 4 years ago

I think so, but looks like a little weak. Are there any other considerations?