kquick / Thespian

Python Actor concurrency library
MIT License
187 stars 24 forks source link

ChildActorExited - Manually killing child PID does not result in ChildActorExited Message to Parent #85

Open bryang-spindance opened 2 years ago

bryang-spindance commented 2 years ago

Code

Below is example code which has reproduced this issue for me.

Project structure

.
├── app
│   ├── __init__.py
│   ├── parent.py
│   └── child.py
├── start.py
└── stop.py

start.py

from thespian.actors import ActorSystem
from app import parent

if __name__ == '__main__':
  asys = ActorSystem('multiprocTCPBase')
  actor = asys.createActor(parent.Parent)
  asys.tell(actor, parent.Initialize())

stop.py

from thespian.actors import ActorSystem

if __name__ == '__main__':
  ActorSystem('multiprocTCPBase').shutdown()

parent.py

from thespian.actors import ActorTypeDispatcher
import os

class Initialize:
  def __init__(self):
    pass

class Parent(ActorTypeDispatcher):
  def __init__(self, *args, **kwargs):
    super(Parent, self).__init__(*args, **kwargs)
    self._child = None

  def receiveMsg_Initialize(self, msg, sender):
    print(f'{self.myAddress} (P|:{os.getpid()}) Parent received Initialize command from {sender}')
    self._child = self.createActor('app.child.Child')
    self.send(self._child, Initialize())

  def receiveMsg_ChildActorExited(self, msg, sender):
    print(f'In Parent: ChildActorExited')

child.py

from thespian.actors import ActorTypeDispatcher
import os

class Child(ActorTypeDispatcher):
  def __init__(self, *args, **kwargs):
    super(Child, self).__init__(*args, **kwargs)

  def receiveMsg_Initialize(self, msg, sender):
    print(f'{self.myAddress} (P|:{os.getpid()}) Child received Initialize command from {sender}')

Procedure

1) Navigate to the root of my project.

~$ cd ~/Documents/thespian/test

2) Start the actor system using start.py

~/Documents/thespian/test$ python start.py

ActorAddr-(T|:1111) (P|:5555) Parent received Initialize command from ActorAddr-(T|:0000)
ActorAddr-(T|:2222) (P|:6666) Child received Initialize command from ActorAddr-(T|:1111)

3) Kill the child actor process by its PID.

~/Documents/thespian/test$ kill -9 6666

4) After tests, shut down actor system.

~/Documents/thespian/test$ python stop.py

Problem

From the procedure above, upon killing the child process in step 3, it is to my understanding that the parent actor should immediately receive a ChildActorExited message and my example program should print out ChildActorExited to the terminal; this, however, does not happen. Instead the remaining parent actor will stay alive and report nothing.

I have tested this same functionality on MacOS and Windows with the same results. I also tried using multiprocUDPBase but again got the same results.

Another thing to note is after killing the child process and running stop.py, the actor system takes a bit longer than usual to shutdown however it does not print any additional information.

Environment


If you need any additional information, please let me know.

kquick commented 2 years ago

Thank you for the clear report with samples to check this.

Unfortunately, I cannot reproduce the problem: the ChildActorExited is delivered immediately for me when running the tests you describe.

Python versions tested: 3.7.12, 3.8.12, 3.9.6

The main difference is that I'm running a Linux system (I don't presently have access to either a Windows or a MacOS system).

07:06:42 issue85$ export PS0="running command at start time \t\n"
running command at start time 07:06:46
07:06:46 issue85$ python start.py
running command at start time 07:06:50
ActorAddr-(T|:38691) (P|:1135063) Parent received Initialize command from ActorAddr-(T|:43313)
ActorAddr-(T|:42113) (P|:1135064) Child received Initialize command from ActorAddr-(T|:38691)
07:06:50 issue85$ kill 1135064
running command at start time 07:06:58
07:06:58 issue85$ In Parent: ChildActorExited
In Parent: ChildActorExited

07:06:59 issue85$ python stop.py
running command at start time 07:07:09
07:07:09 issue85$ 
07:07:14 issue85$ 
07:07:15 issue85$ python start.py
running command at start time 07:07:21
ActorAddr-(T|:34143) (P|:1135204) Parent received Initialize command from ActorAddr-(T|:36281)
ActorAddr-(T|:32769) (P|:1135205) Child received Initialize command from ActorAddr-(T|:34143)
07:07:21 issue85$ kill -9 1135205
running command at start time 07:07:25
07:07:25 issue85$ In Parent: ChildActorExited

07:07:27 issue85$ python stop.py
running command at start time 07:07:30
07:07:30 issue85$ 

Can you please try two things:

  1. Repeat your tests but without the -9 argument to kill
  2. Run the following and let me know the results with (a) python subproc.py and no additional actions, (b) python subproc.py then kill of the reported child process, and (c) python subproc.py + kill -9 of the reported child process. It should report "Parent got signal 17" immediately on the kill commands.

subproc.py:

from multiprocessing import *
import signal
import time
import os

def sleeper(t):
    print('Process %s sleeping for %d seconds' % (os.getpid(), t))
    time.sleep(t)
    print('Process %s awakened and exiting' % os.getpid())

def gotsig(signum, frame):
    print('Parent got signal %s' % signum)

if __name__ == "__main__":
    wait_time = 10
    signal.signal(signal.SIGCHLD, gotsig)
    p = Process(target=sleeper, args=(wait_time,))
    p.start()
    time.sleep(wait_time + 2)
    print('Exiting')
bryang-spindance commented 2 years ago

Repeat your tests but without the -9 argument to kill

When I do this, the process never gets killed and the actor stays alive. Both OSs exhibit the same behavior where I am prevented from killing ActorSystem child actor processes without forcefully killing them.


Note: In MacOS the signal number for SIGCHLD is 20 instead of 17.

Run the following and let me know the results with python subproc.py and no additional actions

Results:

~/test$ python subproc.py
    Process 42158 sleeping for 10 seconds
    Process 42158 awakened and exiting
    Parent got signal 20
    Exiting
~/test$

Run the following and let me know the results with python subproc.py then kill of the reported child process

Results:

~/test$ python subproc.py
    Process 53352 sleeping for 10 seconds
~/test$ kill 53352
    Parent got signal 20
    Exiting
~/test$

I received the exit signal immediately after killing the process.


Run the following and let me know the results with python subproc.py + kill -9 of the reported child process

Results:

~/test$ python subproc.py
    Process 56348 sleeping for 10 seconds
~/test$ kill -9 56348
    Parent got signal 20
    Exiting
~/test$

I received the exit signal immediately after killing the process.


I was unable to run the subproc.py tests on Windows since windows does not have any SIGCHLD signal or anything that I could use as a replacement.

I should have a Linux system up and running in the next couple days so I can get back to you with my test results on that, however it will be on an ARM Cortex processor so hopefully that doesn't throw a wrench into the mix. At least for now it seems like it should work and that this is more of a Mac and Windows specific issue.

This is certainly an odd issue... It almost feels like there may be a permissions issue or something. Are there any environment variables or permissions that I should verify on my end?

Additional environment info: On the mac, I am using PyEnv to manage my multiple python versions. On both mac and windows I am using python virtual environments created with python -m venv but I doubt that is the source of the issue.

kquick commented 2 years ago

Interesting... and a bit unexpected:

  1. The SIGCHLD value is usually 17, not 20, although that signal is not one of the explicit values listed in the POSIX standard, so it's possible it's implementation dependent. I get <Signals.SIGCHLD 17> when I run $ python -c 'import signal; print(repr(signal.SIGCHLD)). On my system, 20 is SIGTSTP which is a terminal stop signal (usually associated with Ctrl-Z by default, although that's remappable elsewhere). What do you get for this python command?
  2. Running kill without the -9 should deliver SIGTERM (usually value 15). What happens if you explicitly run kill -15 PID?
  3. Can you run $ type kill command? I would expect it to report "kill is a shell builtin". Also, what shell are you running?

There seem to be some sort of shenanigans going on with signals that's causing the problem. Normally, thespian will watch for a child process exit by registering for a number of signals (see https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L678-L694 and https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L29-L40). Your results indicate that signals are being delivered promptly, but not necessarily the signal that was expected. It may be possible to resolve this by simply adding the right additional signal to https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L36, but I'd like to try to understand what's happening a little better first.

This situation should not be affected by the hardware architecture or the python virtual environment.

bryang-spindance commented 2 years ago
  1. I checked the reference man pages for OSX which provided the following:

     No    Name         Default Action       Description
     1     SIGHUP       terminate process    terminal line hangup
     2     SIGINT       terminate process    interrupt program
     3     SIGQUIT      create core image    quit program
     4     SIGILL       create core image    illegal instruction
     5     SIGTRAP      create core image    trace trap
     6     SIGABRT      create core image    abort program (formerly SIGIOT)
     7     SIGEMT       create core image    emulate instruction executed
     8     SIGFPE       create core image    floating-point exception
     9     SIGKILL      terminate process    kill program
     10    SIGBUS       create core image    bus error
     11    SIGSEGV      create core image    segmentation violation
     12    SIGSYS       create core image    non-existent system call invoked
     13    SIGPIPE      terminate process    write on a pipe with no reader
     14    SIGALRM      terminate process    real-time timer expired
     15    SIGTERM      terminate process    software termination signal
     16    SIGURG       discard signal       urgent condition present on
                                             socket
     17    SIGSTOP      stop process         stop (cannot be caught or
                                             ignored)
     18    SIGTSTP      stop process         stop signal generated from
                                             keyboard
     19    SIGCONT      discard signal       continue after stop
     20    SIGCHLD      discard signal       child status has changed

    This is showing SICHLD on signal number 20. Here is the output of the requested command from your first point:

    ~/test$ python -c 'import signal; print(repr(signal.SIGCHLD))'
    <Signals.SIGCHLD: 20>
  2. Running with kill -15 PID gives the same results as kill PID since the default signal for the kill command in OSX is 15 (SIGTERM). On that note, in OSX, the kill -9 PID command will deliver a SIGKILL to the process.

  3. type kill returns kill is a shell builtin. I'm using ZSH on my Mac but I'm just using standard Powershell on my Windows computer.

I'm looking through the code you linked and can't see why the exit signal isn't being caught since you're referencing it by name and not number.

I modified your test code in an attempt to view what signals are being received by the killed process.

from multiprocessing import *
import signal
import time
import os

def sleeper(t):
    print('Process %s sleeping for %d seconds' % (os.getpid(), t))
    time.sleep(t)
    print('Process %s awakened and exiting' % os.getpid())

def gotsig(signum, frame):
    print('Parent got signal %s' % signum)

if __name__ == '__main__':
    for i in ['SIGTERM', 'SIGKILL', 'SIGQUIT', 'SIGABRT', 'SIGCHLD']:
        try:
            signum = getattr(signal, i)
            signal.signal(signum, gotsig)
            print(f'{i}: {signum}')
        except (OSError, RuntimeError) as m: # os error
            print(f'Skipping {i}: {m}')

    wait_time = 10
    p = Process(target=sleeper, args=(wait_time,))
    p.start()
    time.sleep(wait_time+2)
    print('Exiting')

By running this code, I get the following output.

SIGTERM: 15
SIGKILL: 9
Skipping SIGKILL: [Errno 22] Invalid argument
SIGQUIT: 3
SIGABRT: 6
SIGCHLD: 20
Process 96152 sleeping for 10 seconds
Parent got signal 20
Exiting

Regardless of which signal I am killing the process with (SIGKILL: 9, SIGTERM: 15, SIGQUIT: 3, etc.) I get the same result signal in the output Parent got signal 20. I also did a bit of digging and found that for both Linux and Mac, their man pages specify the following:

The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.

This explains the invalid argument error for SIGKILL in my outputs.

Please let me know if there is any other info that may help.

kquick commented 2 years ago

Looks like only some of the signal values are defined by POSIX and the others can be remapped by different OS vendors (see https://en.wikipedia.org/wiki/Signal_(IPC) and search for "Portable number"), which is what accounts for the different numbers.

Thanks for trying the explicit -15 on the kill: I just wanted to check in case kill was mapped to a different default signal.

SIGKILL is a pretty emphatic termination signal and usually just used as a last resort, since the default SIGTERM is usually nicer in that it allows the application to perform any cleanup processing it might have setup via an associated signal handler. Regardless, however the child exits, the parent should always get the SIGCHLD, which is what the latest script seems to verify.

I appreciate your patience in working through this with me. When you have a chance, the next set of tests is:

  1. Update your local thespian's thespian/system/transport/TCPTransport.py with https://gist.github.com/kquick/371519aa6cb0bea16961cc100eb7cf3f
  2. THESPLOG_THRESHOLD=DEBUG THESPLOG_MAX_FILESIZE=1024000 THESPLOG_FILE=$(pwd)/run1.log python start.py and then complete your original test process.
  3. Send both run1.log to me (via gist or attachment here, after you've perused and tweaked them to remove any sensitive data you don't want exposed).

That will give me some internal Thespian log information that should hopefully reveal what's different about running in your environment.

bryang-spindance commented 2 years ago

I've attached a few different logs here. All were generated by running the test code originally linked at the beginning of this issue.

kill_minus_15_SIGTERM.log When running to generate kill_minus_15_SIGTERM.log, I attempted to kill the child process using kill -15 PID, which doesn't actually kill the process at all. After running the kill command, I ran stop.py to gracefully shut down the actorsystem.

kill_minus_9_SIGKILL.log When running to generate kill_minus_9_SIGKILL.log, I used kill -9 PID to kill the child process. The child process was killed successfully but as expected didn't report back to the parent with a ChildActorExited message. I then gracefully shut down the actorsystem using stop.py.

I'm noticing in the kill_minus_15_SIGTERM.log file, on line 38, we get

2022-02-11 07:34:12.256399 p27057 Warn Actor app.child.Child @ ActorAddr-(T|:61345) got shutdown signal: 15

whereas no such line exists in the kill_minus_9_SIGKILL.log file.

I also tested using kill -3 and kill -6, both of which exhibited the exact same behavior as kill -15. They didn't result in the process being killed and the actor remained alive. They both did however show up in the log file saying got shutdown signal: 3 and got shutdown signal: 6

kill_minus_17_SIGSTOP.log I then tested using kill -17 which is the SIGSTOP signal on MacOSX, which just like kill -9 (SIGKILL), "cannot be caught or ignored" according to the MacOSX and Linux man pages. This command behaved slightly differently from kill -9. After stopping (not killing since the process never actually died) the child actor process with SIGSTOP, then attempting to run stop.py to gracefully shut down the actorsystem, the shutdown process took longer than expected and resulted in the following error at the end of the log:

2022-02-11 07:55:48.015610 p46550 ERR  No response to Admin shutdown request; Actor system not completely shutdown

I think this is due to SIGSTOP essentially freezing a process which thespian recognizes as still being an existing process. So when thespian tries to kill the frozen process it gets stuck trying to kill the process but can't actually kill it and eventually gives up.

bryang-spindance commented 2 years ago

Quick update. On my Debian system I receive the ChildActorExited message as anticipated.

kquick commented 2 years ago

Thanks for the info and update. The log files are not showing some events I'm expecting to see. I'm looking into getting access to a MacOS environment to see if I can do some local testing/reproduction.

bryang-spindance commented 2 years ago

If there's something in particular you're looking for, I could do a bit of digging to get some more detailed info. Thanks again for your help. :)