kquick / Thespian

Python Actor concurrency library
MIT License
189 stars 24 forks source link

Checking that actors are ready for next assignment #61

Closed davideps closed 2 years ago

davideps commented 4 years ago

I have two types of actors, a single SCHEDULER and many AGENTS. For some activities the scheduler cannot proceed unless all the agents have completed their work. So, the scheduler sends a check_status message to all agents and tracks what they send back. Since actors are single threaded and process messages in the order received (right?), when an agent sees the check_status message, it must be done with its previous work so replies status "ok".

This works fine for up to 800 agents.

When I try 1,000 agents, I get back about 800 "ok" messages--and then nothing. The scheduler waits 30 seconds and then sends a status_check message to each missing agent (not all agents). In some cases it gets back one more "ok". To be clear, there are no "poison" message warnings.

I have three questions:

  1. Is this the best way to know that all agents are ready for more work?
  2. Why are messages going missing? Have I exceeded the inbox size?
  3. Why are messages missing even after the second try? I really expected that to work!
kquick commented 4 years ago

It's not always the case that messages are processed in the order sent. They usually are, but there can be exceptions.

For questions 2 and 3, it's possible that you are encountering the effects rate limiter and the send threshold sizes in combination with busy actors. Thespian uses queueing on the sender side rather than the receiver so that back-pressure propagates throughout the actor network, but the general effect is that if any actor has too many outbound messages queued then Thespian will internally place that actor in transmit-only mode until the queue drops below a specific threshold (under the theory that the actors "respond" to incoming messages and therefore allowing more incoming messages would push the outbound queue higher). If the agent actors are busy (e.g. blocking doing work) then they won't be able to accept new messages, which means that they are queued and retried periodically on the sender's side. After too many retries, they should be aborted as PoisonMessages, but the interim effect is that the send queue threshold would prevent getting responses from even idle agent actors until the busy ones finished their work and processed incoming messages, allowing the scheduler's send queue to drop below the watermark where receives are allowed.

To answer number 1 (and given the context I've described above), I would suggest having the agent actors tell the scheduler when they have completed a particular work item, and the scheduler should have an internal table of which agents are busy and which are idle. This is the approach taken by the troupe helper as well: when incoming work is received the "scheduler"/troupe-leader gets an idle "agent"/worker from the idle queue (creating a new one if the queue is empty and creating more is allowed), then it puts that actor address on the busy queue and sends the work to the "agent"/worker. When the "agent"/worker has completed the work, it sends back a "ready" message to the "scheduler"/troupe-leader which tells the latter to remove that "agent"/worker from the busy list and put it back on the idle list.

davideps commented 4 years ago

Thanks for these suggestions, Kevin. I'll keep working on it.

kquick commented 4 years ago

Sounds good, @davideps. Please feel free to post back on status/issues.

Also FYI, I prefer that the person who opens an issue be the one to close it so that I know that they gotten the information/help that they need and that the issue is no longer a problem for them, so I'm happy to leave this issue open until you feel comfortable closing it.

kquick commented 3 years ago

Hi @davideps , did you get this resolved satisfactorily?

davideps commented 2 years ago

Hi Kevin,

Yes, thank you! At the moment I'm not using Thespian but I do appreciate the help you offered before.

-david

On Mon, May 10, 2021 at 8:54 AM Kevin Quick @.***> wrote:

Hi @davideps https://github.com/davideps , did you get this resolved satisfactorily?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kquick/Thespian/issues/61#issuecomment-836214603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANTN7LUPSHNWI6AN33DDHLTM5YJVANCNFSM4K4D7FMQ .