Open rifoerster opened 2 years ago
Hi @rifoerster ,
It would be great if you were able to reproduce this with a simple Actor or set of Actors that you could then attach to this issue for me to reproduce your issue locally. You mention "after a while" but don't give an actual timeframe; I can say that I've used Actors in production that have run without problems for months at a time, so the behavior you are describing is not the expected behavior.
You can also look at the internal thesplog logging file that Thespian maintains (https://github.com/thespianpy/Thespian/blob/master/thespian/system/utilis.py#L77-L80) although you will probably want to change the maximum size of that file and the base logging severity (https://github.com/thespianpy/Thespian/blob/master/thespian/system/utilis.py#L25-L28).
Hi Kevin,
I'm a colleague of Riccardo and spent a lot of time hunting this bug. Meanwhile I have reorganized our complete ActorSystem, introduced a Registrar Actor keeping track of all created Actors. The Registrar Actor can send regular KeepAliveMsg
'es to all other Actors in the System to check whether they are alive. Meanwhile the relevant message is sent by an Actor as well (not from the surrounding program as before). Nevertheless this very special bug is still there.
The system is an IoT system for telemetry. For every connected measuring device we create a Device Actor that is responsible to forward binary messages to the device. It has to receive a TxBinaryDataMsg
message and handle it.
The Registrar and the KeepAliveMsg
that the Registrar sends every 10 minutes work like a charm. Nevertheless, after about 20 minutes of inactivity the Device Actor doesn't start the message handler for the TxBinaryDataMsg
any more.
Here are some of my findings:
KeepAliveMsg
reaching the Device Actor shortly after the jam can cure it. Before handling the KeepAliveMsg
the Actor handles the older TxBinaryDataMsg
.For now I will stay with multiprocUDPBase what seems to be a reliable workaround but every hint to bring some light into this mystery will be very welcome.
Regards, Michael
The TCP protocol is a streaming protocol, so in order for Thespian to reconstruct a message from the byte stream and confirm that reception, there is some bidirectional communication over the TCP transport between the sending actor and the receiving actor. In contrast, the UDP protocol is message oriented with no confirmation of delivery (thus earning it the "unreliable" label since there's no way to know if the packet was received).
I cannot tell where you are running the different actors in your architecture, or what the actors are doing without some sample code that reproduces the problem, but you might want to verify that bidirectional traffic is fully functional and promptly handled in your network. It may also be that you have some sort of router or gateway device that is closing or otherwise dropping long-running inactive socket connections.
All actors are running on the same computer, a Raspberry Pi 4. Yesterday I have made some tests with two measuring devices connected. Thus I had two Device Actors running. Directly after the start both Actors are responding as expected, then I tried with various times of inactivity. With times shorter than 20 min, both Device Actors have always been responsive. At longer times the first tried Device Actor (A) usually responded, the second one (B) didn't. After trying A once more, B responded as well. Afterwards every further attempt on A or B was a success.
This issue is bothering me since months. Several weeks ago I introduced threads in some of my Actors in order to improve their responsiveness. Thus I experienced this issue for the first time in a reproducible way. It seems to appear, when we send an Actor message from inside a thread that is running within an Actor. The issue was reproducible as well on Linux as on Windows.
Meanwhile I have been able to create a kind of minimal example based on your hellogoodbye.py. Please refer to https://github.com/mistrey/Thespian/blob/master/examples/threaded_hellogoodbye.py.
The example is running with multiprocTCPBase but not with multiprocUDPBase.
With increased sleep time of 0.5 s, I have seen the issue once with
multiprocTCPBase as well.
With the sleep(0.0001)
removed, the example is running as well with
multiprocTCPBase as with multiprocUDPBase
I got a problem with reaching certain actors after a while. The Actor-system sends the message, but the receive function is never called. No dead letter either (Dead letter handling doesn't trigger). For debugging reasons there is a wakeupafter loop, which still runs, where the actor basically sends itself a message every one minute, and which continues to work. Just after a certain, still unknown frame of time of inactivity (except for the wake-up loop) the actor just cases to receive any messages.
Is there any way to debug this further, what information can i provide to pin down the situation, and how do i get those? Version: 3.10.5 Python: 3.9 Pipenv: 2021.11.23