Specify, Monitor, and Enforce Actor Resource Usage ("actor time")

TheNeikos commented 4 years ago

Is setting a hard-cap on CPU Time/RAM a possibility right now? I've looked through the API and didn't find anything pertaining to that. Or is this just not yet implemented?

autodidaddict commented 4 years ago

This is not implemented yet but it's a great idea. When I have some time I'll take a look and see if there are any built in features in the wasmtime engine that support this or if we've got to code it ourselves.

TheNeikos commented 4 years ago

In the same vein, I think it would be great to also get those stats as well. So that one can find out how much each actor is currently consuming.

autodidaddict commented 4 years ago

I couldn't agree more. There's a bit of a hack for this where you could use the Prometheus middleware and then use a grafana dashboard to watch the usage of each actor ... But what we really should have is a nice, holistic way of tracking and potentially limiting usage of actors and providers.

autodidaddict commented 4 years ago

Thinking "out loud":

One approach might be to keep track of "actor time", which would be an accumulation of the total number of milliseconds spent in execution by the actor since start time. By maintaining this value, we can expose it for query as well as return an Err when we attempt to invoke an actor that might have exceeded some defined (env variable?) quota for actor time.

For completeness, if we maintained "actor time", we would also need to:

Include actor execution time when responding to lattice inventory queries
Expose an API call in WasccHost that can query the actor time for an individual actor or maybe all actors in the host

When declaring a quota for actor time, we could define a sliding window, where the actor can spend no more than n milliseconds in execution within a time period of t minutes. If t is 0, that would essentially be an absolute limit rather than a sliding window limit and, once exceeded, the actor would be unusable. Ancillary question: If an actor exceeds its actor time budget for an absolute quota, should we just remove the actor from the host rather than continuing to return an Err for each invocation?

NB: I'd like to avoid having to keep track of things like physical CPU usage because that just makes the host all the more platform/arch/OS dependent (not to mention unreliable depending on where the host is running), whereas counting milliseconds spent in execution is something that can easily be done without adding dependency trees.

TheNeikos commented 4 years ago

I don't think that removing a host per default is a good idea. For example, if you are hosting an actor where someone buys X time, they can then easily 'buy more' so to speak by adding to the available time, so that the next time they get invoked it works again.

And since the Error itself would be descriptive (I imagine something like Error::NoRunningTimeLeft or so), removing the actor from the host would always be a possibility.

autodidaddict commented 4 years ago

Excellent point. I hadn't considered the use case where an actor's execution quota could be modified live while actively deployed. I was thinking entirely from the microservices "deploy once and leave it until the next update" perspective.

As this feature gets implemented then, I think there are some requirements:

Should be able to specify the quota for an actor via API calls to WasccHost
Should be able to specify/alter the quota for an actor via API calls on the lattice
Quota can be specified either as a fixed value or as a value with a sliding window (consume a actor time within n seconds)
Actor execution time should be query-able via both WasccHost API and API calls on the lattice
Quota for an actor should be applied to an aggregate of all running instances of that actor. If there are 3 instances of actor A running, then they will all consume "actor time" toward the same aggregate quota
If an actor is the target of a dispatch and it has exceeded its actor time quota, return a ExecutionQuotaExceeded error, which will then allow the consumer to decide whether or not they want to remove the actor or, as you say, someone could add more quarters to the machine and keep the actor alive.

I will edit the subject of this issue to convert it from a question into an imperative to implement this functionality.

TheNeikos commented 4 years ago

Sounds great! Some things I can think of about possible issues:

What if during timed execution the actor calls into a provider that then in turns takes some time. Does the whole thing get timed? Only the actor part? Can it be configured/told apart?
Should actors be able to query their own execution time? (I suppose trivial with the right provider)
If an actor calls another actor, whose time gets reduced? Both?

autodidaddict commented 4 years ago

I think this falls into the definition of actor time as an actor being "active" (in-call). If an actor receives a message and in turns makes a call on a capability provider, the time that actor spends waiting for that response would be accrued as actor time. If this kind of stuff starts to accrue more than people can afford / tolerate, they might want to refactor to use more cast patterns as enabled by #72
Actors should not be able to query their own execution time. This is a philosophical thing for me - actors should never be concerned with non-functional requirements, and monitoring , billing, and enforcing usage is an infrastructure / host concern and not an actor concern.
If an actor calls another actor, it seems straightforward that they would both accrue actor time. Another viewpoint on this is that time spent waiting for an actor isn't actually "execution". I think this comes down to the definition of actor time and the purpose of it. If we're referring to actor time as time spent active, then waiting for another actor to return is active, and thus both actors should accrue actor time. If someone looks at their usage profile and they see a lot of these synchronous "waiting for actor B to come back" patterns, that might inspire them to adopt a more asynchronous pattern where they publish work requests, go idle, and then wake back up when the work is completed via receiving another message

autodidaddict commented 4 years ago

I think a good mantra to ask as this gets implemented might be, "Would a FaaS bill for this time?" . If a function wakes up and then makes a synchronous call to some other billable resource, then the function is accruing billable usage at the same time as the resource being consumed.

TheNeikos commented 4 years ago

Ah, the cast part would solve my question perfectly. As in my case a network connection is being made and I don’t want ‘punish’ any actors for things outside of their direct control.

I’m willing to help realizing these requests btw. I’ll send you an email soon, so we could coordinate if you wish.

Cheers!

wasmCloud / wascc-host

Specify, Monitor, and Enforce Actor Resource Usage ("actor time") #69