Error while starting training in single GPU setting

muditbac-curefit commented 4 years ago

https://github.com/megvii-detection/MSPN/blob/a84f750aaa34e32ded49c44dda6e73a6538c4fde/cvpack/torch_modeling/engine/engine.py#L56

The variable engine.local_rank is not being set when training with single GPU. I have fixed that by setting local_rank in non distributed setting like this:

        if self.distributed:
            self.local_rank = self.args.local_rank
            self.world_size = int(os.environ['WORLD_SIZE'])
            self.world_rank = int(os.environ['RANK'])
            torch.cuda.set_device(self.local_rank)
            dist.init_process_group(backend="nccl", init_method='env://')
            dist.barrier()
            self.devices = [i for i in range(self.world_size)]
        else:
            # todo check non-distributed training
            self.local_rank = self.args.local_rank
            self.world_rank = 1
            self.devices = parse_torch_devices(self.args.devices)

Can you please let me know if this is the right way to do it?

FYI, when not doing this I am getting error that engine has no property named local_rank

fenglinglwb commented 4 years ago

Thanks for pointing out this. Your solution seems right.

muditbac-curefit notifications@github.com 于2020年6月13日周六下午4:34写道：

https://github.com/megvii-detection/MSPN/blob/a84f750aaa34e32ded49c44dda6e73a6538c4fde/cvpack/torch_modeling/engine/engine.py#L56

The variable engine.local_rank is not being set when training with single GPU. I have fixed that by setting local_rank in non distributed setting like this:
    if self.distributed:
        self.local_rank = self.args.local_rank
        self.world_size = int(os.environ['WORLD_SIZE'])
        self.world_rank = int(os.environ['RANK'])
        torch.cuda.set_device(self.local_rank)
        dist.init_process_group(backend="nccl", init_method='env://')
        dist.barrier()
        self.devices = [i for i in range(self.world_size)]
    else:
        # todo check non-distributed training
        self.local_rank = self.args.local_rank
        self.world_rank = 1
        self.devices = parse_torch_devices(self.args.devices)
Can you please let me know if this is the right way to do it?

FYI, when not doing this I am getting error that engine has no property named local_rank

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/megvii-detection/MSPN/issues/24, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4GMA6EVPV7CRZETSE6HQDRWM2ZZANCNFSM4N44S4QA .

muditbac-curefit commented 4 years ago

Closing this issue and created a pull request, https://github.com/megvii-detection/MSPN/pull/25

megvii-research / MSPN

Error while starting training in single GPU setting #24