Closed aeblyve closed 1 year ago
@gc00 @leonidbelyaev Hi all, do you mind waiting for me to do a thorough test on VASP before pushing the code in?
@jiamingz9925 I do not mind. I think that Yao said that this does not reverse the hanging issue you are also facing, though.
@jiamingz9925 I do not mind. I think that Yao said that this does not reverse the hanging issue you are also facing, though.
I think I knew some tricks to work around that, mainly want to make sure I can restart multiple times with VASP/PdO4 case.
I did not see any problem during 4 nodes 128 ranks run. I was able to restart multiple times
In PR #348 , I wrote code that would remove the MPI_Group
group_world
created inregisterLocalSendsAndRecvs
unconditionally.But this is problematic because of the semantics of the ADD_NEW macro. See virtual-ids.h, lines 214-223. If the group does not yet exist, this is indeed an extra group. However, if the group already exists (i.e., the application has created the same
group_world
), then there is a collision, and the free can make a restart fail, for it erases a group that the application depends on, or one of the child groups depends on. Seempi-proxy-split/mpi-wrappers/mpi_group_wrappers
, line 100-105. (This is in some sense fixed by virtual-ids, in that there is no more dependency between groups in this way).This can cause an extra LOG_CALL and LOG_REPLAY to occur as written, but it is not a huge problem. We are working to remove LOG_CALL and LOG_REPLAY altogether in
virtualids-v2
anyways.For now, we should probably revert these changes for the main branch. I have done so, and also wrote some commentary to explain the situation.