Open bryang-spindance opened 2 years ago
Thank you for the clear report with samples to check this.
Unfortunately, I cannot reproduce the problem: the ChildActorExited
is delivered immediately for me when running the tests you describe.
Python versions tested: 3.7.12, 3.8.12, 3.9.6
The main difference is that I'm running a Linux system (I don't presently have access to either a Windows or a MacOS system).
07:06:42 issue85$ export PS0="running command at start time \t\n"
running command at start time 07:06:46
07:06:46 issue85$ python start.py
running command at start time 07:06:50
ActorAddr-(T|:38691) (P|:1135063) Parent received Initialize command from ActorAddr-(T|:43313)
ActorAddr-(T|:42113) (P|:1135064) Child received Initialize command from ActorAddr-(T|:38691)
07:06:50 issue85$ kill 1135064
running command at start time 07:06:58
07:06:58 issue85$ In Parent: ChildActorExited
In Parent: ChildActorExited
07:06:59 issue85$ python stop.py
running command at start time 07:07:09
07:07:09 issue85$
07:07:14 issue85$
07:07:15 issue85$ python start.py
running command at start time 07:07:21
ActorAddr-(T|:34143) (P|:1135204) Parent received Initialize command from ActorAddr-(T|:36281)
ActorAddr-(T|:32769) (P|:1135205) Child received Initialize command from ActorAddr-(T|:34143)
07:07:21 issue85$ kill -9 1135205
running command at start time 07:07:25
07:07:25 issue85$ In Parent: ChildActorExited
07:07:27 issue85$ python stop.py
running command at start time 07:07:30
07:07:30 issue85$
Can you please try two things:
-9
argument to kill
python subproc.py
and no additional actions, (b) python subproc.py
then kill
of the reported child process, and (c) python subproc.py
+ kill -9
of the reported child process. It should report "Parent got signal 17" immediately on the kill
commands.subproc.py:
from multiprocessing import *
import signal
import time
import os
def sleeper(t):
print('Process %s sleeping for %d seconds' % (os.getpid(), t))
time.sleep(t)
print('Process %s awakened and exiting' % os.getpid())
def gotsig(signum, frame):
print('Parent got signal %s' % signum)
if __name__ == "__main__":
wait_time = 10
signal.signal(signal.SIGCHLD, gotsig)
p = Process(target=sleeper, args=(wait_time,))
p.start()
time.sleep(wait_time + 2)
print('Exiting')
Repeat your tests but without the -9 argument to kill
When I do this, the process never gets killed and the actor stays alive. Both OSs exhibit the same behavior where I am prevented from killing ActorSystem child actor processes without forcefully killing them.
Note: In MacOS the signal number for SIGCHLD
is 20 instead of 17.
Run the following and let me know the results with
python subproc.py
and no additional actions
Results:
~/test$ python subproc.py
Process 42158 sleeping for 10 seconds
Process 42158 awakened and exiting
Parent got signal 20
Exiting
~/test$
Run the following and let me know the results with
python subproc.py
then kill of the reported child process
Results:
~/test$ python subproc.py
Process 53352 sleeping for 10 seconds
~/test$ kill 53352
Parent got signal 20
Exiting
~/test$
I received the exit signal immediately after killing the process.
Run the following and let me know the results with
python subproc.py
+kill -9
of the reported child process
Results:
~/test$ python subproc.py
Process 56348 sleeping for 10 seconds
~/test$ kill -9 56348
Parent got signal 20
Exiting
~/test$
I received the exit signal immediately after killing the process.
I was unable to run the subproc.py
tests on Windows since windows does not have any SIGCHLD
signal or anything that I could use as a replacement.
I should have a Linux system up and running in the next couple days so I can get back to you with my test results on that, however it will be on an ARM Cortex processor so hopefully that doesn't throw a wrench into the mix. At least for now it seems like it should work and that this is more of a Mac and Windows specific issue.
This is certainly an odd issue... It almost feels like there may be a permissions issue or something. Are there any environment variables or permissions that I should verify on my end?
Additional environment info:
On the mac, I am using PyEnv to manage my multiple python versions.
On both mac and windows I am using python virtual environments created with python -m venv
but I doubt that is the source of the issue.
Interesting... and a bit unexpected:
SIGCHLD
value is usually 17
, not 20
, although that signal is not one of the explicit values listed in the POSIX standard, so it's possible it's implementation dependent. I get <Signals.SIGCHLD 17>
when I run $ python -c 'import signal; print(repr(signal.SIGCHLD))
. On my system, 20
is SIGTSTP
which is a terminal stop signal (usually associated with Ctrl-Z
by default, although that's remappable elsewhere). What do you get for this python command?-9
should deliver SIGTERM
(usually value 15
). What happens if you explicitly run kill -15 PID
?$ type kill
command? I would expect it to report "kill is a shell builtin". Also, what shell are you running?There seem to be some sort of shenanigans going on with signals that's causing the problem. Normally, thespian will watch for a child process exit by registering for a number of signals (see https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L678-L694 and https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L29-L40). Your results indicate that signals are being delivered promptly, but not necessarily the signal that was expected. It may be possible to resolve this by simply adding the right additional signal to https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L36, but I'd like to try to understand what's happening a little better first.
This situation should not be affected by the hardware architecture or the python virtual environment.
I checked the reference man pages for OSX which provided the following:
No Name Default Action Description
1 SIGHUP terminate process terminal line hangup
2 SIGINT terminate process interrupt program
3 SIGQUIT create core image quit program
4 SIGILL create core image illegal instruction
5 SIGTRAP create core image trace trap
6 SIGABRT create core image abort program (formerly SIGIOT)
7 SIGEMT create core image emulate instruction executed
8 SIGFPE create core image floating-point exception
9 SIGKILL terminate process kill program
10 SIGBUS create core image bus error
11 SIGSEGV create core image segmentation violation
12 SIGSYS create core image non-existent system call invoked
13 SIGPIPE terminate process write on a pipe with no reader
14 SIGALRM terminate process real-time timer expired
15 SIGTERM terminate process software termination signal
16 SIGURG discard signal urgent condition present on
socket
17 SIGSTOP stop process stop (cannot be caught or
ignored)
18 SIGTSTP stop process stop signal generated from
keyboard
19 SIGCONT discard signal continue after stop
20 SIGCHLD discard signal child status has changed
This is showing SICHLD on signal number 20. Here is the output of the requested command from your first point:
~/test$ python -c 'import signal; print(repr(signal.SIGCHLD))'
<Signals.SIGCHLD: 20>
Running with kill -15 PID
gives the same results as kill PID
since the default signal for the kill command in OSX is 15 (SIGTERM
). On that note, in OSX, the kill -9 PID
command will deliver a SIGKILL
to the process.
type kill
returns kill is a shell builtin
. I'm using ZSH on my Mac but I'm just using standard Powershell on my Windows computer.
I'm looking through the code you linked and can't see why the exit signal isn't being caught since you're referencing it by name and not number.
I modified your test code in an attempt to view what signals are being received by the killed process.
from multiprocessing import *
import signal
import time
import os
def sleeper(t):
print('Process %s sleeping for %d seconds' % (os.getpid(), t))
time.sleep(t)
print('Process %s awakened and exiting' % os.getpid())
def gotsig(signum, frame):
print('Parent got signal %s' % signum)
if __name__ == '__main__':
for i in ['SIGTERM', 'SIGKILL', 'SIGQUIT', 'SIGABRT', 'SIGCHLD']:
try:
signum = getattr(signal, i)
signal.signal(signum, gotsig)
print(f'{i}: {signum}')
except (OSError, RuntimeError) as m: # os error
print(f'Skipping {i}: {m}')
wait_time = 10
p = Process(target=sleeper, args=(wait_time,))
p.start()
time.sleep(wait_time+2)
print('Exiting')
By running this code, I get the following output.
SIGTERM: 15
SIGKILL: 9
Skipping SIGKILL: [Errno 22] Invalid argument
SIGQUIT: 3
SIGABRT: 6
SIGCHLD: 20
Process 96152 sleeping for 10 seconds
Parent got signal 20
Exiting
Regardless of which signal I am killing the process with (SIGKILL: 9, SIGTERM: 15, SIGQUIT: 3, etc.) I get the same result signal in the output Parent got signal 20
. I also did a bit of digging and found that for both Linux and Mac, their man pages specify the following:
The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.
This explains the invalid argument error for SIGKILL in my outputs.
Please let me know if there is any other info that may help.
Looks like only some of the signal values are defined by POSIX and the others can be remapped by different OS vendors (see https://en.wikipedia.org/wiki/Signal_(IPC) and search for "Portable number"), which is what accounts for the different numbers.
Thanks for trying the explicit -15
on the kill: I just wanted to check in case kill
was mapped to a different default signal.
SIGKILL
is a pretty emphatic termination signal and usually just used as a last resort, since the default SIGTERM
is usually nicer in that it allows the application to perform any cleanup processing it might have setup via an associated signal handler. Regardless, however the child exits, the parent should always get the SIGCHLD
, which is what the latest script seems to verify.
I appreciate your patience in working through this with me. When you have a chance, the next set of tests is:
thespian/system/transport/TCPTransport.py
with https://gist.github.com/kquick/371519aa6cb0bea16961cc100eb7cf3f THESPLOG_THRESHOLD=DEBUG THESPLOG_MAX_FILESIZE=1024000 THESPLOG_FILE=$(pwd)/run1.log python start.py
and then complete your original test process.run1.log
to me (via gist or attachment here, after you've perused and tweaked them to remove any sensitive data you don't want exposed).That will give me some internal Thespian log information that should hopefully reveal what's different about running in your environment.
I've attached a few different logs here. All were generated by running the test code originally linked at the beginning of this issue.
kill_minus_15_SIGTERM.log
When running to generate kill_minus_15_SIGTERM.log
, I attempted to kill the child process using kill -15 PID
, which doesn't actually kill the process at all. After running the kill command, I ran stop.py
to gracefully shut down the actorsystem.
kill_minus_9_SIGKILL.log
When running to generate kill_minus_9_SIGKILL.log
, I used kill -9 PID
to kill the child process. The child process was killed successfully but as expected didn't report back to the parent with a ChildActorExited message. I then gracefully shut down the actorsystem using stop.py
.
I'm noticing in the kill_minus_15_SIGTERM.log
file, on line 38, we get
2022-02-11 07:34:12.256399 p27057 Warn Actor app.child.Child @ ActorAddr-(T|:61345) got shutdown signal: 15
whereas no such line exists in the kill_minus_9_SIGKILL.log
file.
I also tested using kill -3
and kill -6
, both of which exhibited the exact same behavior as kill -15
. They didn't result in the process being killed and the actor remained alive. They both did however show up in the log file saying got shutdown signal: 3
and got shutdown signal: 6
kill_minus_17_SIGSTOP.log
I then tested using kill -17
which is the SIGSTOP
signal on MacOSX, which just like kill -9
(SIGKILL
), "cannot be caught or ignored" according to the MacOSX and Linux man pages. This command behaved slightly differently from kill -9
. After stopping (not killing since the process never actually died) the child actor process with SIGSTOP
, then attempting to run stop.py
to gracefully shut down the actorsystem, the shutdown process took longer than expected and resulted in the following error at the end of the log:
2022-02-11 07:55:48.015610 p46550 ERR No response to Admin shutdown request; Actor system not completely shutdown
I think this is due to SIGSTOP
essentially freezing a process which thespian recognizes as still being an existing process. So when thespian tries to kill the frozen process it gets stuck trying to kill the process but can't actually kill it and eventually gives up.
Quick update. On my Debian system I receive the ChildActorExited message as anticipated.
Thanks for the info and update. The log files are not showing some events I'm expecting to see. I'm looking into getting access to a MacOS environment to see if I can do some local testing/reproduction.
If there's something in particular you're looking for, I could do a bit of digging to get some more detailed info. Thanks again for your help. :)
Code
Below is example code which has reproduced this issue for me.
Project structure
start.py
stop.py
parent.py
child.py
Procedure
1) Navigate to the root of my project.
2) Start the actor system using
start.py
3) Kill the child actor process by its PID.
4) After tests, shut down actor system.
Problem
From the procedure above, upon killing the child process in step 3, it is to my understanding that the parent actor should immediately receive a
ChildActorExited
message and my example program should print outChildActorExited
to the terminal; this, however, does not happen. Instead the remaining parent actor will stay alive and report nothing.I have tested this same functionality on MacOS and Windows with the same results. I also tried using
multiprocUDPBase
but again got the same results.Another thing to note is after killing the child process and running
stop.py
, the actor system takes a bit longer than usual to shutdown however it does not print any additional information.Environment
thespian
.If you need any additional information, please let me know.