mpickpt / mana

MANA for MPI
35 stars 24 forks source link

REVERT potential resource leak fix (and added new comments) #363

Closed aeblyve closed 1 year ago

aeblyve commented 1 year ago

In PR #348 , I wrote code that would remove the MPI_Group group_world created in registerLocalSendsAndRecvs unconditionally.

But this is problematic because of the semantics of the ADD_NEW macro. See virtual-ids.h, lines 214-223. If the group does not yet exist, this is indeed an extra group. However, if the group already exists (i.e., the application has created the same group_world), then there is a collision, and the free can make a restart fail, for it erases a group that the application depends on, or one of the child groups depends on. See mpi-proxy-split/mpi-wrappers/mpi_group_wrappers, line 100-105. (This is in some sense fixed by virtual-ids, in that there is no more dependency between groups in this way).

This can cause an extra LOG_CALL and LOG_REPLAY to occur as written, but it is not a huge problem. We are working to remove LOG_CALL and LOG_REPLAY altogether in virtualids-v2 anyways.

For now, we should probably revert these changes for the main branch. I have done so, and also wrote some commentary to explain the situation.

jiamingz9925 commented 1 year ago

@gc00 @leonidbelyaev Hi all, do you mind waiting for me to do a thorough test on VASP before pushing the code in?

aeblyve commented 1 year ago

@jiamingz9925 I do not mind. I think that Yao said that this does not reverse the hanging issue you are also facing, though.

jiamingz9925 commented 1 year ago

@jiamingz9925 I do not mind. I think that Yao said that this does not reverse the hanging issue you are also facing, though.

I think I knew some tricks to work around that, mainly want to make sure I can restart multiple times with VASP/PdO4 case.

jiamingz9925 commented 1 year ago

I did not see any problem during 4 nodes 128 ranks run. I was able to restart multiple times