Actor system deadlock using multiprocQueueBase

lns-ross commented 3 years ago

First, many thanks for this package.

During our Due Diligence stress testing we encountered an issue that only appears when using the multiprocQueueBase.

We have a simple tester script (attached) that we are using to check throughput and limits of the various system bases along with troupes and application messaging modes. All it does is send messages (int's) to an actor that echoes it back to the caller. The caller then verifies it was an expected value. There are two main modes: sync (send one, check one) and async (send all and then check responses; or send in chunks and check). It is this async 'send all' mode that seems to fail and only in the Queue base (see below).

Script help follows:

Usage: sender.py [-h] [-s] [-c] [{-Q|-T|-U}] [<num_msgs>]

where:
  -s        Send in sync mode.  Default: async
  -c        Send in chunks for async mode
  -t        Use a troupe of echoers
  -Q        Use 'multiprocQueueBase'
  -T        Use 'multiprocTCPBase'
  -U        Use 'multiprocUDPBase'
  -h        Prints this help message and exits

  <num_msgs> - the number of messages to send for echo. Default: 1000

We have validated that this issue exists in python v3.8.8 and v3.9.2 on both OSX and Debian buster (our available operational envs).

The command runs just fine when it is run in any mode other than the following two basic commands:

$ ./sender.py -Q
or:
$ ./sender.py -t -Q

When one of those commands is run as-is (or with any value > ~55) the actor system seems to deadlock. So much so that when you hit ^C to kill it the shutdown() attempt also fails to complete. In fact I have to kill -9 the actor processes.

There doesn't seem to be any activity I can detect, so I am fairly certain this isn't an application level issue. Unless I'm missing something about the Queue system base that may be causing this.

Thespian Version: 3.10.4

TIA

lns-ross commented 3 years ago

sender.zip

kquick commented 3 years ago

Thanks for posting this. Apologies for the delay: I was out on vacation.

I do not recommend using the multiprocQueue base for high-load systems for a number of reasons, including the issues you ran into above. I am hesitant to blame the Queue module in the standard Python library, but in my local testing I have seen similar deadlocks and hangs and dropped messages from the Queue and have not identified anything in the Thespian code that seems like it could be causing those.

Another reason I don't recommend the Queue base is that a Python Queue can only be established between a parent process and a child process (the parent and child actors in this case), so any delivery of messages outside this range has to be forwarded up the actor "tree". For example:

   Actor A
      creates: Actor B
                         creates Actor C
                                            creates Actor E
                         creates Actor D
                                            creates Actor F

Assuming for the moment that all Actors are aware of every other actor address, if F wants to send a message to E, the following occurs:

F checks if address E corresponds to a Queue F has. It does not, so:
F sends the message to its parent (actor D).
D gets the message from F, sees that the target is E, does the same check as step 1 with the same failure, so:
D sends the message to its parent (actor B).
B does the same check as D, and while it does not have a direct queue to E, it does have information (passed upward as a result of the createActor C did for E) that C could forward the message, so it sends the message to C.
C receives the message, does the check as in step 1 as finds that it does have a Queue to E, so it sends the message via that Queue.
E receives the message.

The above is clearly not very efficient (as noted in https://thespianpy.com/doc/using.html#hH-59ea5ca7-a167-4d51-bd1b-78119eab6df6); the multiprocTCPBase or multiprocUDPBase would not use this type of message forwarding and would send directly between E and F, which is why they are preferable to the multprocQueueBase.

Also note that the multiprocUDPBase does not inherently provide message delivery guarantees, so it is possible to drop messages (as you can observe with your chunk-based sending). The multiprocTCPBase is the slowest but also the most reliable and flexible.

lns-ross commented 3 years ago

Really helpful clarification. We were most concerned that it only happened in one mode on one system base. If it's a known and non-recommended scenario then we will stop worrying about it. Closing.

Thx again.

kquick / Thespian

Actor system deadlock using multiprocQueueBase #76