Distributed Initialization Fails When Pretraining with Multiple GPUs

fzohra commented 1 year ago

I'm able to successfully start the training on a single node single GPU setup, but fail when I increase the number of GPUs.

For example, on an A100 with 2 GPUs, if I run the following with deepspeed enabled:

CUDA_VISIBLE_DEVICES='0,1' python -m torch.distributed.launch --nproc_per_node=2 --master_port=5566 main_pretrain_yaml.py --config _args/args_pretrain.json

I can see that both GPUs (ranks 0 and 1) are seemingly able to initialize the distributed training, but while GPU rank 0 continues to run as expected, GPU rank 1 becomes unresponsive. Furthermore, It appears that only one process on the CPU starts and is pinned on one of the gpus.

Here's a snippet from the logs:

INFO - __main__ -   Init distributed training on local rank 0
INFO - __main__ -   Init distributed training on local rank 1
INFO - torch.distributed.distributed_c10d -   Added key: store_based_barrier_key:1 to store for rank: 1
INFO - torch.distributed.distributed_c10d -   Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
INFO - torch.distributed.distributed_c10d -   Added key: store_based_barrier_key:1 to store for rank: 0
INFO - torch.distributed.distributed_c10d -   Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
...

[INFO] [comm.py:594:init_distributed] cdb=None
INFO - torch.distributed.distributed_c10d -   Added key: store_based_barrier_key:2 to store for rank: 0
INFO - torch.distributed.distributed_c10d -   Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=2, worker_count=1, timeout=0:30:00)
INFO - torch.distributed.distributed_c10d -   Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=2, worker_count=1, timeout=0:30:00)
INFO - torch.distributed.distributed_c10d -   Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=2, worker_count=1, timeout=0:30:00)
...

This issue arises when attempting to distribute the computational workload across multiple data files (cc3m/webvid2.5m_train_0.caption.tsv to cc3m/webvid2.5m_train_9.caption.tsv) verses when using a single file (cc3m/webvid2.5m_train_0.caption.tsv) so it seems like the problem may be in the cpu's data loading/handling of the files. I have tried increasing the number of workers without success.

Note that this occurs in the code when making a call to self.model, self.optzr, _, _ = deepspeed.initialize(config_params=config, model=self.model, optimizer=self.optzr, lr_scheduler=self.lr_scheduler)

And similarly, in the case that deepspeed is not enabled, at self.model = T.nn.parallel.DistributedDataParallel(self.model, device_ids=[get_local_rank()], output_device=get_local_rank(), find_unused_parameters=True)

Please help, thanks!

fzohra commented 1 year ago

Turns out that this issue was a result of os.environ['LOCAL_SIZE'] not being set so the local_size variable used by the NodeSplitSampler was incorrectly defaulting to 1. The variable LOCAL_SIZE doesn't appear to be a standard environment variable used in PyTorch's distributed package, is it?

Setting LOCAL_SIZE to the number of GPUs on a single node resolves the issue.

In which case, given 8 gpus on a single node, are these variables in the NodeSplitSampler expected:

world_size=8 local_size=8 node_size=1 node_idx=0 rank=GPU rank (0-7) local_rank=GPU rank (0-7)

In doing so, am I right in understanding that the expected behavior of get_index_on_node is to assign all 10 of the composite files (0-9) for each dataset to the single node. And the expected behavior ofget_index_on_rank is to split the full dataset by indices between the 8 gpus.

tsujuifu commented 1 year ago

Thanks for pointing this out!

The distributed variables will be set automatically in our training environment, so I missed that part in this release code 😅.

tsujuifu commented 1 year ago

It should work after adding LOCAL_SIZE='8'. I also update the command in README.

Super thanks again for your clarification 😍

tsujuifu / pytorch_empirical-mvm

Distributed Initialization Fails When Pretraining with Multiple GPUs #3