Closed bwehrle closed 2 years ago
@bwehrle You make some good points about the Actor potentially having bugs or just a poor design. So on second thought, perhaps the exception is best, at least for the foreseeable future.
The idea I thought of presenting is what I just brought up in #104 about how to use a sort of "shared mailbox," or similar approach
(1) I think that adding worker actors behind a root router actor. That is, your current actor becomes just a router to other actors that actually do the work. Of the following I think that SmallestMailboxRouter
would be best, because the root router would dispatch the incoming message to the actor that is least busy; that is, least busy from the perspective of lowest message count. This does not account for the time required for any given worker actor.
io.vlingo.xoom.actors.BroadcastRouter<P>
io.vlingo.xoom.actors.ContentBasedRouter<P>
io.vlingo.xoom.actors.RandomRouter<P>
io.vlingo.xoom.actors.RoundRobinRouter<P>
io.vlingo.xoom.actors.SmallestMailboxRouter<P>
(2) On the other hand, you can introduce backpressure using a work-stealing approach. This is where worker actors request up to N new messages to work on when their current workload is zero. This is basically like Reactive Streams, but lighter weight (and possibly faster). You could test with XOOM Streams where all actors have an "arrayQueue"
mailbox. The problem with work stealing is that it requires extra messages from the worker to the root router.
https://docs.vlingo.io/xoom-streams https://github.com/vlingo/xoom-streams
Either of the above approaches enables you to add/remove workers upon increasing/decreasing workload.
I think the only approach that seems to work involves an actor writing to its own outbox, and being able to run when the outbox has space. A worker thread sources from these outboxes and writes them to destination inboxes, and the source worker can run again.
This requires a non-cyclic DAG in order to be guaranteed to work, exactly what you are describing in Xoom Streams. Each problem has a different solution.
@bwehrle Please let me know if XOOM Streams is the way you will go. It could be that a new Processor
filter could be used for routing. We can provide one for round-robin or least busy. The backpressure protocol helps with that, but more could be done at the upstream source.
So far, I think I can't tell if that's needed. The only task I would say coming from this issue is a note in the Mailbox documenation on what happens when things fill up, and stating that if the developer needs some kind of guarantees this will not happen then they need to use Xoom streams, or use another mailbox and accept the risk of an OOM.
The ManyToOne mailbox can easily overflow when the Actor does not process messages in time -or- when it contributes to the overflow by sending messages to itself.
Previous to PR#104, this would cause a deadlock, and the CPU would spin indefinitely when an overflow would happen. After the PR, and overflow leads to an unchcked exception, which then impacts the calling thread or actor. This issue is to identify if there are more optimal solutions based on what can actually be done.
First, any actor that is running out of space has the condition that the number of events entering is > than events being processed over a sufficiently long period of time that the difference causes the queue to fill.
Second, actors can cause this problem themselves due to incorrect design or bugs. We should consider this a degenerate case that should cause the actor to stop running and be suspended by its supervisor.
Third, applying backpressure in a non-cyclical DAG system can lead to deadlocks,
Given the above, solutions are:
Raising exception and failing the write -> this will lead to the supervisor eventually suspending the actor. The system continues, but the actor is no longer responsive. Clients will also receive exceptions when trying to send messages, but if possible an adapt or also fail.
Create a backstop queue that gracefully handles the overflow of a fast queue -> this is only a fix for a temporary surge in requests. If the other queue fills up will be an EOM and end of process.
@VaughnVernon any other ideas?