Closed fzohra closed 1 year ago
Turns out that this issue was a result of os.environ['LOCAL_SIZE']
not being set so the local_size variable used by the NodeSplitSampler
was incorrectly defaulting to 1. The variable LOCAL_SIZE doesn't appear to be a standard environment variable used in PyTorch's distributed package, is it?
Setting LOCAL_SIZE to the number of GPUs on a single node resolves the issue.
In which case, given 8 gpus on a single node, are these variables in the NodeSplitSampler
expected:
world_size=8 local_size=8 node_size=1 node_idx=0 rank=GPU rank (0-7) local_rank=GPU rank (0-7)
In doing so, am I right in understanding that the expected behavior of get_index_on_node
is to assign all 10 of the composite files (0-9) for each dataset to the single node. And the expected behavior ofget_index_on_rank
is to split the full dataset by indices between the 8 gpus.
Thanks for pointing this out!
The distributed variables will be set automatically in our training environment, so I missed that part in this release code 😅.
I'm able to successfully start the training on a single node single GPU setup, but fail when I increase the number of GPUs.
For example, on an A100 with 2 GPUs, if I run the following with deepspeed enabled:
CUDA_VISIBLE_DEVICES='0,1' python -m torch.distributed.launch --nproc_per_node=2 --master_port=5566 main_pretrain_yaml.py --config _args/args_pretrain.json
I can see that both GPUs (ranks 0 and 1) are seemingly able to initialize the distributed training, but while GPU rank 0 continues to run as expected, GPU rank 1 becomes unresponsive. Furthermore, It appears that only one process on the CPU starts and is pinned on one of the gpus.
Here's a snippet from the logs:
This issue arises when attempting to distribute the computational workload across multiple data files (cc3m/webvid2.5m_train_0.caption.tsv to cc3m/webvid2.5m_train_9.caption.tsv) verses when using a single file (cc3m/webvid2.5m_train_0.caption.tsv) so it seems like the problem may be in the cpu's data loading/handling of the files. I have tried increasing the number of workers without success.
Note that this occurs in the code when making a call to
self.model, self.optzr, _, _ = deepspeed.initialize(config_params=config, model=self.model, optimizer=self.optzr, lr_scheduler=self.lr_scheduler)
And similarly, in the case that deepspeed is not enabled, at
self.model = T.nn.parallel.DistributedDataParallel(self.model, device_ids=[get_local_rank()], output_device=get_local_rank(), find_unused_parameters=True)
Please help, thanks!