intel / torch-ccl

oneCCL Bindings for Pytorch*
BSD 3-Clause "New" or "Revised" License
86 stars 25 forks source link

ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35

Open Zha0q1 opened 2 years ago

Zha0q1 commented 2 years ago

Hi torch-ccl community,

I was trying to run the follow code with PT 1.10 + ccl backend:

import torch
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch_ccl
dist.init_process_group(backend="ccl")
class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10, bias=False)
        self.net2 = nn.Linear(10, 10)
    def forward(self, x):
        return self.net2(self.net1(x))
model = ToyModel()
ddp = torch.nn.parallel.DistributedDataParallel(
    model,
    find_unused_parameters=True)

inp = torch.randn(1, 10)
out = ddp(inp)

When find_unused_parameters=True, the destructor of ProcessGroupCCL was not correctly called. When find_unused_parameters=False there was no issue. This should have been fine in most cases because the destructor is empty anyways https://github.com/intel/torch-ccl/blob/master/src/ProcessGroupCCL.cpp#L109-L111. However, I am trying to build an extension which requires me to release resources in ~ProcessGroupCCL(). If ~ProcessGroupCCL() is being not called, the process will hang on exit. This issue also does not exist in PT 1.9. Seems like some object life cycle management issue with PyTorch

Would appreciate any insights and help!

chengjunlu commented 2 years ago

Hi @Zha0q1,

I cannot reproduce the issue of "the destructor of ProcessGroupCCL was not correctly called" The ~ProcessGroupCCL can always be called on the end of the python life for both the find_unused_parameters=True and find_unused_parameters=False

There maybe some requirements on the sequence of the exiting clean up of your code.

Please be aware the destructor of ProcessGroup is called when clean up the refer to python object at the end of python life.

Zha0q1 commented 2 years ago

Hi @chengjunlu thanks for your reply! Would you share the hardware and software stack you used? This issue only occurred with PT 1.10 for me -- PT 1.9 worked just fine. I was using an AWS P4d instance with 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-cpu-py38-ubuntu20.04-sagemaker being the base image

chengjunlu commented 2 years ago

I am using the public pytorch v1.10.0-rc3 tag for the 1.10 release.

Would you help to double check whether this issue could be reproduced without your changes?

Zha0q1 commented 2 years ago

Hi I used the v1.10.0 tag and built pytorch from source. And yes, even with https://github.com/intel/torch-ccl/tree/ccl_torch1.10 this branch the issue is still reproducible. I only added a std::cout in the destructor to show it was called/ not called.

chengjunlu commented 2 years ago

Let's try more experiment:

  1. Add some debug information in the destructor on ProcessGroup.
  2. Can you show the ABI of the pytorch in your platform torch._C._GLIBCXX_USE_CXX11_ABI?
Zha0q1 commented 2 years ago
  1. Do you mean the Pytorch ProcessGroup?
  2. it shows True One more question: did you try the same script I used?
chengjunlu commented 2 years ago
  1. Do you mean the Pytorch ProcessGroup? Yes.
  2. it shows True One more question: did you try the same script I used? Yes.
Zha0q1 commented 2 years ago

Sure I will do more experiments on Monday. Do you have any insights as to what might be the issue?

chengjunlu commented 2 years ago

Sure I will do more experiments on Monday. Do you have any insights as to what might be the issue?

It is bizarre issue. I don't have the strong confidence about the root cause. The hard part is that I cannot reproduce your issue in my platform.

Here are just some points we can look into:

The process group in PT1.10 is managed by intrusive ptr. There is drawback in C++ in the cross reference of smart pointer blocking the destruction of objects correctly. The attribute reducer of DistributedDataParallel and the Reducer keeps a reference to the process group (in the test, the object of ProcessGroupCCL). Another attribute _default_pg also keeps a reference to it. But Neither of them kept a cross reference to each other. We need to further investigate it.

Another aspect we can check is the pybind itself, less possible but who knows.