vlingo / xoom-actors

The VLINGO XOOM platform SDK for the type-safe Actor Model, delivering Reactive concurrency, high scalability, high-throughput, and resiliency using Java and other JVM languages.
https://vlingo.io
Mozilla Public License 2.0
229 stars 28 forks source link

ManyToOne Mailbox type overflow needs a agreed-upon solution #105

Closed bwehrle closed 2 years ago

bwehrle commented 2 years ago

The ManyToOne mailbox can easily overflow when the Actor does not process messages in time -or- when it contributes to the overflow by sending messages to itself.

Previous to PR#104, this would cause a deadlock, and the CPU would spin indefinitely when an overflow would happen. After the PR, and overflow leads to an unchcked exception, which then impacts the calling thread or actor. This issue is to identify if there are more optimal solutions based on what can actually be done.

First, any actor that is running out of space has the condition that the number of events entering is > than events being processed over a sufficiently long period of time that the difference causes the queue to fill.

Second, actors can cause this problem themselves due to incorrect design or bugs. We should consider this a degenerate case that should cause the actor to stop running and be suspended by its supervisor.

Third, applying backpressure in a non-cyclical DAG system can lead to deadlocks,

Given the above, solutions are:

@VaughnVernon any other ideas?

VaughnVernon commented 2 years ago

@bwehrle You make some good points about the Actor potentially having bugs or just a poor design. So on second thought, perhaps the exception is best, at least for the foreseeable future.

The idea I thought of presenting is what I just brought up in #104 about how to use a sort of "shared mailbox," or similar approach

(1) I think that adding worker actors behind a root router actor. That is, your current actor becomes just a router to other actors that actually do the work. Of the following I think that SmallestMailboxRouter would be best, because the root router would dispatch the incoming message to the actor that is least busy; that is, least busy from the perspective of lowest message count. This does not account for the time required for any given worker actor.

io.vlingo.xoom.actors.BroadcastRouter<P>
io.vlingo.xoom.actors.ContentBasedRouter<P>
io.vlingo.xoom.actors.RandomRouter<P>
io.vlingo.xoom.actors.RoundRobinRouter<P>
io.vlingo.xoom.actors.SmallestMailboxRouter<P>

(2) On the other hand, you can introduce backpressure using a work-stealing approach. This is where worker actors request up to N new messages to work on when their current workload is zero. This is basically like Reactive Streams, but lighter weight (and possibly faster). You could test with XOOM Streams where all actors have an "arrayQueue" mailbox. The problem with work stealing is that it requires extra messages from the worker to the root router.

https://docs.vlingo.io/xoom-streams https://github.com/vlingo/xoom-streams

Either of the above approaches enables you to add/remove workers upon increasing/decreasing workload.

bwehrle commented 2 years ago

I think the only approach that seems to work involves an actor writing to its own outbox, and being able to run when the outbox has space. A worker thread sources from these outboxes and writes them to destination inboxes, and the source worker can run again.

This requires a non-cyclic DAG in order to be guaranteed to work, exactly what you are describing in Xoom Streams. Each problem has a different solution.

VaughnVernon commented 2 years ago

@bwehrle Please let me know if XOOM Streams is the way you will go. It could be that a new Processor filter could be used for routing. We can provide one for round-robin or least busy. The backpressure protocol helps with that, but more could be done at the upstream source.

bwehrle commented 2 years ago

So far, I think I can't tell if that's needed. The only task I would say coming from this issue is a note in the Mailbox documenation on what happens when things fill up, and stating that if the developer needs some kind of guarantees this will not happen then they need to use Xoom streams, or use another mailbox and accept the risk of an OOM.