same train_loader but got different loader size

Hyaloid commented 11 months ago

Hi, @deepakn94 , I was running alexnet.gpus=4_straight with mp_conf.json. Here is what I got about len(train_loader):

rank0: 320291
rank1: 250000
rank2: 250000
rank3: 250000

train_loader is defined in main_with_runtime.py:

if configuration_maps['stage_to_rank_map'] is not None:
    num_ranks_in_first_stage = len(configuration_maps['stage_to_rank_map'][0])
    if num_ranks_in_first_stage > 1:
        train_sampler = torch.utils.data.distributed.DistributedSampler(
            train_dataset, num_replicas=num_ranks_in_first_stage,
            rank=args.rank)
        val_sampler = torch.utils.data.distributed.DistributedSampler(
            val_dataset, num_replicas=num_ranks_in_first_stage,
            rank=args.rank)
        distributed_sampler = True

train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
    num_workers=args.workers, pin_memory=True, sampler=train_sampler, drop_last=True)

batch_size is the same in each rank. Using the definition of train_loader, each rank should have the same len(train_loader) I think, but actually they are different. Do you know why this happens?

And training with different len(train_loader), I got this error when all 250000 iterations are finished:

Exception in thread Thread-16:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/threading.py"，line 916，in _bootstrap_inner
self.run()
File"/opt/conda/lib/python3.6/threading.py"，line 864，in run
self.target(*self.args, **self.kwargs)
File "../communication.py"，line 632，in send_helper_thread
sub_process_group=sub_process_group)
File "../communication.py"， line 709，in _send
dist.send(tensor=tensor_shape. dst=dst_rank， tag=tag)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py"
line 608，in send
default_pg.send([tensor]，dst，tag).wait()
RuntimeError: [../third_party/glo0/gloo/transport/tcp/unbound.buffer.cc:119] Timed out waiting 180000ms for send operation to complete

Perhaps this is caused by the reasons mentioned above? It would be so appreciated if you could help!

Hyaloid commented 11 months ago

Each stage uses SyntheticDataset((3, 224, 224), 1000000) except the first stage. So the size of dataset is different.

Hyaloid commented 11 months ago

Maybe you should annotate the two lines in main_with_runtime.py to obtain the same datasets,

 if not is_first_stage():
    args.synthetic_data = True

maybe you should annotate the two lines to obtain the same datasets, or you should use -s in the command line to use synthetic_data to test whether pipedream works.

msr-fiddle / pipedream

same train_loader but got different loader size #80