pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.78k stars 22.59k forks source link

Dataloader Rerunning with num_workers=0 may give better error trace #58847

Open MarsSu0618 opened 3 years ago

MarsSu0618 commented 3 years ago

❓ Questions and Help

Hi, everyone. When i use ddp, i have encounter some question.. And I want to running on single node 4 gpus on gcp. If i set num_work=0, it will be work but training is slow. I want to boost training time. But i set num_work>0 that always get follow error message.

Error message

RuntimeError: DataLoader worker exited unexpectedly with exit code 1. 
Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace

code

import torch
from absl import app

def launch_training_job(local_rank, 
                       processed_dataset):
    ### ddp ###
    torch.distributed.init_process_group(backend='nccl',
                                         world_size=4,
                                         rank=local_rank)
    torch.cuda.set_device(local_rank)
    print('[INFO] Starting nccl for ddp.')

    distributed_sampler = torch.utils.data.distributed.DistributedSampler(processed_dataset)

    processed_sms_dataloader = torch.utils.data.DataLoader(processed_dataset,
                                                           batch_size=32,
                                                           pin_memory=True,
                                                           num_workers=2,
                                                           sampler=distributed_sampler)

def main(argv):
    .......
    num_gpus=4
    torch.multiprocessing.spawn(launch_training_job,
                                args=(processed_dataset),
                                nprocs=num_gpus)
if __name__ == "__main__":
    app.run(main)

Environment

Hope someone can help, I will appreciate...

@SsnL I found that you have answered a similar issue, can you help me? Thank you.

cc @SsnL @VitalyFedyunin @ejguan @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23

wanchaol commented 3 years ago

seems like dataloader sometimes not working nicely with DDP, do you have a smaller repro script? i.e. with a simple model and dataset it should not failing i think?

@VitalyFedyunin did you see things similar before and happen to know what's going on there?

ejguan commented 3 years ago

I am not familiar with DDP. But, can you try to use single DataLoader instance with multiple workers with DistributedSampler and DDP? When you set num_worker > 0, several worker processes have been spawned or forked. Too many processes can also create bottleneck for performance.