pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
538 stars 280 forks source link

gforker: propagation of exit status #7115

Open dalcinl opened 2 weeks ago

dalcinl commented 2 weeks ago

A few times a month, mpi4py's testsuite running daily in GitHub Actions fails the following way:

======================================================================
FAIL: testInterrupt (__main__.TestRunScript.testInterrupt)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/mpi4py-testing/mpi4py-testing/demo/test-run/test_run.py", line 173, in testInterrupt
    self.assertEqual(status, SIGINT + 128)
AssertionError: 221 != 130

Full logs here

This is the test code is here. What this test does is to execute some Python code with np=1,2 , and then one of the ranks (but not both in the np=2 case) signals itself with SIGINT. Upon an uncaught SIGINT, Python's signal/error handling prints to stderr and exits cleanly with return status=SIGINT+128 (that it, 130 on Linux). My test is indeed checking that exit statuses are propagated properly from the MPI process through mpiexec and back to the user. Very sporadically, on GitHub Actions, that is not the case for gforker.

I believe there must be either some sort of race condition, or perhaps a two short waiting time, that originates this non-reproducible issue. I looked at gforker's sources, but I'm afraid I do not understand it well enough as to guess where the problem may be. Any tips on how to further debug the issue?

dalcinl commented 2 weeks ago

Fantastic! Now we have phishing in GitHub comments? :facepalm:

hzhou commented 2 weeks ago

The code that aggregates exit code is here : https://github.com/pmodels/mpich/blob/4f8e4a968012d6e59e22b606a76f613d61a7a91c/src/pm/util/process.c#L207-L209

If you can reproduce it locally, I would try add debug printf there to see if anything is usual.

PS: after that, I would try check here: https://github.com/pmodels/mpich/blob/4f8e4a968012d6e59e22b606a76f613d61a7a91c/src/pm/util/process.c#L407-L427

dalcinl commented 2 weeks ago

Here you have a reproducer that at least show different behavior between hydra and gforker, and their interaction with the child process in case of SIGNIT vs no signal.

code='raise KeyboardInterrupt'
#code='raise RuntimeError'
#code='import sys; sys.exit(42)'
python -c "$code" >/dev/null 2>&1 
echo $?
mpiexec.hydra -n 1 python -c "$code" >/dev/null 2>&1 
echo $?
mpiexec.gforker -n 1 python -c "$code" >/dev/null 2>&1 
echo $?

The code raise KeyboardInterrupt behaves in a special way, see the end of this comment.

If you run the exact lines above in bash, you should get:

130
2
0

If you uncomment #code='raise RuntimeError', then you should get

1
1
1

If you uncomment #code='import sys; ...', then you should get

42
42
42

I'll detail an important Python-specific piece of information: Upon an unhandled KeyboardInterupt exception, Python basically does the following: https://github.com/python/cpython/blob/main/Modules/main.c#L732

setsig(SIGINT, SIG_DFL);
kill(getpid(), SIGINT);
return SIGINT + 128;
hzhou commented 2 weeks ago

I see. Thanks for the simple reproducer! I can make gforker to return the same as hydra. Essentially, in https://github.com/pmodels/mpich/blob/4f8e4a968012d6e59e22b606a76f613d61a7a91c/src/pm/util/process.c#L407-L427 Line 414, we should just add rc = prog_stat;.

dalcinl commented 2 weeks ago

Indeed, the following patch makes hydra and gforker behave identically with my reproducer. I think it would be a good idea to push these changes to make the two PMs behave consistently.

diff --git a/src/pm/util/process.c b/src/pm/util/process.c
index b5b1ad2d1e..c0757144a2 100644
--- a/src/pm/util/process.c
+++ b/src/pm/util/process.c
@@ -411,6 +411,7 @@ void MPIE_ProcessSetExitStatus(ProcessState * pState, int prog_stat)
         rc = WEXITSTATUS(prog_stat);
     }
     if (WIFSIGNALED(prog_stat)) {
+        rc = prog_stat;
         sigval = WTERMSIG(prog_stat);
     }
     pState->exitStatus.exitStatus = rc;

However, after further consideration, I do not think this is the root of problem in my CI failures. When executing mpiexec through Python's subprocess module, it seems things are handled different. Otherwise, hydra and gforker should behave differently, and that is not the case (most of the time), and I do not have an explanation for it. I'll keep investigating.