microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

Barrier for all tasks #5244

Open ultmaster opened 3 years ago

ultmaster commented 3 years ago

Is there any recommended practice to wait for preparation complete for all tasks? That is, to insert a barrier.

I've implemented one (for server-client scenario) with PyTorch, though I believe there might be better options. For example, something that has been natively supported by pai runtime.

#!/usr/bin/env python3
import functools
import os

import torch

print = functools.partial(print, flush=True)

master_addr = os.environ['PAI_HOST_IP_server_0']
master_port = os.environ['PAI_PORT_LIST_server_0_torch']
task_count = int(os.environ['PAI_TASK_ROLE_TASK_COUNT_client']) + 1
if os.environ['PAI_CURRENT_TASK_ROLE_NAME'] == 'server':
    task_rank = 0
else:
    task_rank = int(os.environ['PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX']) + 1
torch.distributed.init_process_group('gloo', init_method=f'tcp://{master_addr}:{master_port}',
                                     world_size=task_count, rank=task_rank)
print('PAI barrier: waiting for other workers...')
torch.distributed.barrier()
abuccts commented 3 years ago

Hi, you could check this sshbarrier config, which will start all tasks at the similar time https://github.com/microsoft/pai/blob/9d7c1aca76269d61e36ab46feca1d667a64154e1/marketplace-v2/horovod-pytorch-synthetic-benchmark.yaml#L67-L72

ultmaster commented 3 years ago

@abuccts Thanks for reply. But what I want is to have a barrier at the middle of my task, after, for example, my data download is complete.