Open Zha0q1 opened 2 years ago
Hi @Zha0q1,
I cannot reproduce the issue of "the destructor of ProcessGroupCCL was not correctly called"
The ~ProcessGroupCCL can always be called on the end of the python life for both the find_unused_parameters=True
and find_unused_parameters=False
There maybe some requirements on the sequence of the exiting clean up of your code.
Please be aware the destructor of ProcessGroup is called when clean up the refer to python object at the end of python life.
Hi @chengjunlu thanks for your reply! Would you share the hardware and software stack you used? This issue only occurred with PT 1.10 for me -- PT 1.9 worked just fine. I was using an AWS P4d instance with 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-cpu-py38-ubuntu20.04-sagemaker
being the base image
I am using the public pytorch v1.10.0-rc3 tag for the 1.10 release.
Would you help to double check whether this issue could be reproduced without your changes?
Hi I used the v1.10.0 tag and built pytorch from source. And yes, even with https://github.com/intel/torch-ccl/tree/ccl_torch1.10 this branch the issue is still reproducible. I only added a std::cout in the destructor to show it was called/ not called.
Let's try more experiment:
torch._C._GLIBCXX_USE_CXX11_ABI
?
- Do you mean the Pytorch ProcessGroup? Yes.
- it shows True One more question: did you try the same script I used? Yes.
Sure I will do more experiments on Monday. Do you have any insights as to what might be the issue?
Sure I will do more experiments on Monday. Do you have any insights as to what might be the issue?
It is bizarre issue. I don't have the strong confidence about the root cause. The hard part is that I cannot reproduce your issue in my platform.
Here are just some points we can look into:
The process group in PT1.10 is managed by intrusive ptr. There is drawback in C++ in the cross reference of smart pointer blocking the destruction of objects correctly.
The attribute reducer
of DistributedDataParallel
and the Reducer
keeps a reference to the process group (in the test, the object of ProcessGroupCCL). Another attribute _default_pg
also keeps a reference to it.
But Neither of them kept a cross reference to each other. We need to further investigate it.
Another aspect we can check is the pybind itself, less possible but who knows.
Hi torch-ccl community,
I was trying to run the follow code with PT 1.10 + ccl backend:
When
find_unused_parameters=True
, the destructor of ProcessGroupCCL was not correctly called. Whenfind_unused_parameters=False
there was no issue. This should have been fine in most cases because the destructor is empty anyways https://github.com/intel/torch-ccl/blob/master/src/ProcessGroupCCL.cpp#L109-L111. However, I am trying to build an extension which requires me to release resources in ~ProcessGroupCCL(). If ~ProcessGroupCCL() is being not called, the process will hang on exit. This issue also does not exist in PT 1.9. Seems like some object life cycle management issue with PyTorchWould appreciate any insights and help!