slawlor / ractor

Rust actor framework
MIT License
1.31k stars 68 forks source link

Multiple concurrent async tasks on actors #133

Closed Dessix closed 11 months ago

Dessix commented 11 months ago

While the documentation around it is quite sparse, the current implementation notes that long-running asynchronous tasks "block" the actor's message handler, and that delays or similar should be translated to timers and message events.

For use as a sort of in-app microservice architecture, this makes it difficult to run long-running tasks as an actor, as something like sending them a graceful stop request could have the message handler trigger a cancellation token stored in the actor's state, but the message pump is not being polled as long as the actor's state is claimed.

Suggested solution

If actor state were held under async lock guards, messages could be queued immediately any time the state is not currently locked, allowing for actors to "yield" back to the loop any time they are not currently processing work.

For any task which must be performed in a serial manner, holding the guard across the await allows the message pump to be disabled until the runtime can achieve a new lock, allowing actors to opt in to non-concurrency.

This may even be simpler to implement than the above- in that the message handler could always be called by the runtime, with the expectation that any message handler that uses "state" will simply await a lock on the state guard. This means that messages will be processed as concurrently as possible, but has the performance drawback in that it could result in many message futures being queued and awaiting a lock on that actor's state.

Potential alternatives

Additional context

The root of the issue is that concurrency is limited by mutable access to Actor::State. Bevy's scheduling approach of reader / writer isolation may also be of interest here, in that queries which read a resource may be concurrent, while queries which write to that resource must be run in isolation from each other.

jsgf commented 11 months ago

So is the main reason the factories don't work for you is cancellation?

slawlor commented 11 months ago

So a few things to note that might address what you've outlined here.

  1. We have a cancellation token, of sorts, which is to kill() the actor. That will interrupt work at the next async await point, cancelling the downstream tasks. This however isn't a super clean pattern, if you have some custom initialization logic that you might need to clean up for example.
  2. Similar to @jsgf 's point, this is sounding like a use-case for a factory of sorts. The factory itself isn't ever blocked on async work as it's just a glorified job scheduler, and the workers can be blocked for short or long periods. If the nature of your work is long, asynchronous tasks and many of them concurrently, it's a perfectly valid use-case to have 1k's, 10k's, or 100k's of workers to the factory. The only concern with a high worker count is if the message rate is also very high, the factory may start being unable to keep up with scheduling incoming messages and build a backlog.
  3. A "mutex" actor which issues multiple read locks to shared state is also a valid use-case, and is something that's commonly built. However that's a building block of a higher nature than the runtime itself. Doing anything with shared state in an actor makes ownership quite difficult to reason over and Rust gives us a simple ownership model with mutable borrows guaranteeing that no one else has a reference, so we're free to mutate how we see fit.

If you just want a long-running task actor, which still has a higher interactive control on it, you have basically 3 options.

  1. Factories, as stated above, but are overkill for just a single task
  2. The supervisor model, where the supervisor doesn't execute any operation directly, but can manage the child "worker" (so a factory of concurrency level = 1) and can directly reply to coordination messages (stop/start/restart/etc). It can interrupt the child via the kill() primitive, and capture the death via supervision callback, then do whatever it wants to restart/exit/etc.
  3. Just spawn a long-running task and keep the join handle in the actor's state, freeing the message pump. This requires more manual management for things like panics and whatnot, but for simple use-cases is probably the fastest and safe enough route. The actor can respond to messages and simple abort the task when it wants.
slawlor commented 11 months ago

closing as answered, feel free to re-open if you have additional concerns