Closed andatt closed 5 years ago
Hi @andatt
It looks like there are a couple of things going on with your code:
The error code 3585 is an indication that the Actor System is shutting down, which is why the Pending Actor create requests are being errored. Although you haven't shown it above, I assume there is an ActorSystem().shutdown()
being called somewhere.
The PrimaryActor will emit 11 sends to SecondaryActor, which has a troupe size of 200, so each of the 11 messages will probably result in a new SecondaryActor. Each of those SecondaryActor instances will create another troupe (TertiaryActor) and send 10 messages, resulting in ~10 TertiaryActor instances for each SecondaryActor. Each TertiaryActor will forward to a QuaternaryActor; the Quaternary is a single troupe for the entire ActorSystem, but since it's maximum is 200 and it will receive about 100 messages, each Quaternary instance will handle a single message. Since the troupe count allowance for all of these is 200, and each troupe is only receiving ~10 messages (except for QuaternaryActor, which receives about 100) then there is no re-use of troupe members.
The reason you are not seeing the number of expected "Receive message number" log entries is that the self.createActor()
operation is asynchronous, as is the self.send()
, and nearly all the troupe members will exit when their receiveMessage()
exits, which automatically issues an exit to their children. Thus, you are requesting the creation of about 220 local processes, with a message sent from one to the next when the creation completes, but the sending actors are exiting almost immediately without confirming the send operation, and then the entire ActorSystem is getting shutdown before most of this can occur.
The suggestions for causing this to behave as you expect:
Confirm that a child actor has finished performing all requested activity before the parent exits. This is usually done by having the child send a "completed" or "idle" message back to the parent.
If a troupe member will create another actor to perform work, then it should set self.troupe_work_in_progress = true
when it sends the work to the child actor, and clear that when it receives the "completed" or "idle" back from that child. This will cause the troupe member to be identified as "busy" and therefore not be killed by the troupe leader until it becomes idle.
You don't need to set this troupe_work_in_progress
member to true
for the TertiaryActor, because QuaternaryActor has a globalName
and therefore there is only one system-wide instance and therefore its parent is the ActorSystemManager, not the TertiaryActor, so shutting down the TertiaryActor will not kill the QuaternaryActor (although it may not complete the send of the message to QuaternaryActor before it exits).
Similarly to the confirmation of completion or idleness from each actor to its parent, you probably want a similar indicator before issuing the ActorSystem().shutdown()
operation to ensure that the actor system has completed all the requested work before being shutdown.
Let me know if this helps or if there are still unexpected behaviors, Kevin
Hi Kevin
That makes sense re the actor system shutdown - there was an erroneous uncommented call which I have now removed.
I modified the secondary / tertiary actors as follows:
@troupe(max_count=200, idle_count=1)
class SecondaryActor(ActorTypeDispatcher):
child_count = 0
children_finished = 0
def receiveMessage(self, msg, sender):
self.troupe_work_in_progress = True
if isinstance(msg, list):
if not hasattr(self, "helper"):
self.helper = self.createActor(
TertiaryActor
)
for data in msg:
self.child_count += 1
self.send(
self.helper,
{"from": self.myAddress, "data": data}
)
def receiveMsg_str(self, msg, sender):
self.children_finished +=1
if self.children_finished == self.child_count:
self.troupe_work_in_progress = False
@troupe(max_count=200, idle_count=1)
class TertiaryActor(ActorTypeDispatcher):
def receiveMessage(self, msg, sender):
if isinstance(msg, dict):
qa = self.createActor(
QuaternaryActor,
globalName="quaternay"
)
self.send(
qa,
msg["data"]
)
self.send(msg["from"], "done!")
Running now produces the output in the logfile:
INFO ActorAddr-(T|:35549) => Received message number 6
INFO ActorAddr-(T|:44669) => Received message number 11
INFO ActorAddr-(T|:33267) => Received message number 1
INFO ActorAddr-(T|:39331) => Received message number 7
INFO ActorAddr-(T|:38979) => Received message number 12
INFO ActorAddr-(T|:46669) => Received message number 5
INFO ActorAddr-(T|:33061) => Received message number 3
INFO ActorAddr-(T|:38861) => Received message number 13
INFO ActorAddr-(T|:38275) => Received message number 9
INFO ActorAddr-(T|:37443) => Received message number 14
INFO ActorAddr-(T|:38881) => Received message number 16
INFO ActorAddr-(T|:36561) => Received message number 8
INFO ActorAddr-(T|:41383) => Received message number 17
INFO ActorAddr-(T|:38431) => Received message number 10
INFO ActorAddr-(T|:39275) => Received message number 2
INFO ActorAddr-(T|:35219) => Received message number 4
INFO ActorAddr-(T|:33423) => Received message number 1
INFO ActorAddr-(T|:45961) => Received message number 11
INFO ActorAddr-(T|:45961) => Received message number 15
INFO ActorAddr-(T|:45961) => Received message number 5
INFO ActorAddr-(T|:45961) => Received message number 10
INFO ActorAddr-(T|:45961) => Received message number 20
INFO ActorAddr-(T|:40683) => Received message number 25
INFO ActorAddr-(T|:40683) => Received message number 5
INFO ActorAddr-(T|:40683) => Received message number 20
INFO ActorAddr-(T|:40683) => Received message number 25
INFO ActorAddr-(T|:40683) => Received message number 15
INFO ActorAddr-(T|:40683) => Received message number 10
INFO ActorAddr-(T|:40683) => Received message number 5
So it still seems to be not outputting the all the expected messages. I can confirm all the messages reach and are processed by the tertiary actor. But they don't seem to reach the Quaternary actor. Nor do the str messages from tertiary seem to be received by secondary.
What am I missing here?
Thanks
Andrew
Hi @andatt,
The sample code you are posting doesn't match the behavior you describe. I can make fixes to the sample code to make it work like I think you intended, but in the end result of that it works fully as expected, so ultimately I can't be sure exactly where your error is.
Actually, it resulted in too many messages being delivered, which revealed a sequencing bug in the troupe implementation. I've pushed a fix for that bug, but that wasn't the effect you were seeing.
Some of the changes I made to your posted code to compare to your runs:
ActorTypeDispatcher
base class should implement receiveMsg_TYPE
methods, but should not implement a receiveMessage
method (see your SecondaryActor
and TertiaryActor
above). In addition, when using the ActorTypeDispatcher
, it is no longer necessary to call isinstance(msg, TYPE)
because the dispatcher only routes messages of the specified type to that method.Actor
base class should ensure that the message is of the right type before performing work. Notably, your PrimaryActor
would always try to run the test whenever it received any message. When the ActorSystem is shutdown, all actors receive an ActorExitRequest
, which caused the PrimaryActor to attempt to send test data again. When the first SecondaryActor troupe exited, the PrimaryActor would receive a ChildActorExited
, which would cause it to try to run the test yet again. This added messages to dead or dying actors which then needed to timeout and be processed, which interferes with the shutdown.I created a gist (https://gist.github.com/kquick/38a23b58e16f1505720a4a18fece012f) with the updated version of your code. When I run this (using the latest master for the too-many-messages troupe fix above) I get the expected 110 messages in bug_check_2.log.
Well this is weird. When I copy paste your code and run it I get exactly the same behaviour I was experiencing with my original code i.e. very few messages in bug_check_2.log.
Just in case I changed the troupe decorators to just @troupe() and made the sleep time at the end 20 seconds. Still the same result. I also tried setting env var THESPLOG_THRESHOLD=info. But I am getting no file at /tmp/thespian.log. Thespian version is 3.9.0.
So I am stumped as to whats going on right now. No doubt it's something very silly I am doing somewhere...
Maybe I will see if I can replicate it inside a docker container using an ubuntu 16.04 image or something like that.
ok so having run:
docker run -w /home -v /path/to/test/script:/home ubuntu:16.04 bash -c "apt-get update && apt-get install -y python-pip && pip install thespian && python possible_thesp_bug.py"
I now get the same output as you from inside the container. So the problem appears to be something on my local machine. Local machine is also ubuntu 16:04
found the issue - the version of Thespian installed was 3.9.0 - an old version. Not sure how this happened given the version in my requirements.txt is 3.9.8...Anyway initial problem now solved by upgrading to 3.9.8 - now have the issue of too many messages you noted above. Will there be another release with that fix?
Thanks
Hi @andatt,
I'm glad you found the root issue with the old version! The thespian/troupe.py is updated on master, and that should give the correct number of messages. If that version looks good to you I'll do a release this weekend.
-Kevin
Hi Kevin
Just tested with those new commits and it works good for me! There is another intermittment issue I am experiencing but have not yet established the likely cause. It may not be Thespian related. If it looks like it is though I will create a new issue.
Thanks yet again for all your help!
Andrew
Released in version 3.9.9. https://github.com/kquick/Thespian/releases/tag/thespian-3.9.9
Hi Kevin,
I have another one for you:
I am expecting to see 110 'Received message' entries in logfile. Instead the actor system hangs, actors stop work but are still there. My logfile output (truncated) is:
/tmp/thespian.log is:
Any ideas what'ts going on?
Thanks