Guarantee execution time per the invocation request like AWS Lambda

huasiy commented 4 months ago

Describe the enhancement In Lambda, the maximum execution time for each request is 15 minutes, and the instance executing the request will not be terminated prematurely. However, based on experiments and existing issues in Knative, requests in Knative may be terminated prematurely. To prevent pods from being prematurely deleted, I currently have to increase the stable window, but this will lower resource utilization. Does anyone have a better solution?

leokondrashov commented 3 months ago

Hi, this is not entirely true. We can increase the grace period so that the ongoing invocation has a chance to finish execution. With correct termination handling, the instance would only exist during the execution and be destroyed as soon as the request is done. So it won't consume more resources for terminating pods.

huasiy commented 3 months ago

Thank you for your response. The method you mentioned indeed ensures that requests are not prematurely terminated. Now, I am wondering whether Knative can provide a Lambda-like warm-up mechanism, where instances can be kept in a warm state if requests are sent at a certain frequency ?

leokondrashov commented 2 months ago

By default, instances are controlled by the autoscaler, which decides when to scale up and down. It does this based on the observed concurrent request count over a window. So, it holds the instances for some time, although the logic is a bit more complicated compared to keep-alive policies that are commonly used.

If we are talking about keeping a single instance warm, you can trigger the execution once a window period (60s by default). Keeping two instances warm is much trickier; I won't even try to reason how to make that happen. Also, increasing the autoscaling window helps store warm instances for longer periods of time.

huasiy commented 2 months ago

Unfortunately, I need to maintain a pool of instances with an arbitrary number (greater than 1), does this mean I can't use Knative to achieve this? Are there any other software solutions that can provide a "warm" instance mechanism similar to AWS Lambda?

leokondrashov commented 2 months ago

Sorry, didn't state the obvious solution. My previous answer focused on a trick to make the instance warm, but there is a proper solution to retain a specific scale. You can set minimum scale (docs) of the function. That works like provisioned concurrency in AWS Lambda, you will always have at least this amount of instances for this function with the ability scale up automatically.

huasiy commented 2 months ago

Apologies, my previous discussion is not clear. I don't want to pre-allocate instances. As discussed previously, I am wondering whether Knative can provide a Lambda-like warm-up mechanism, where instances can be kept in a warm state if requests are sent at a certain frequency. As you've pointed out, knative autoscaler scales up and down based on the observed concurrent request count over a window. However. when knative scales down, it does't know which instances are processing request, so it will kill processing instances. I want to know if knative can know the state of each instance, and only kill free instances. If knative can't, what other software can do?

leokondrashov commented 2 months ago

First, yes, it is not possible right now in knative to terminate a specific pod, so we may terminate the pod with a running request. This is a feature request in knative and k8s for 5 years (you added the link in the original question).

Second, I don't think that this is a problem. The process of pod termination is not instant. At the transition to the terminating state, the pod receives SIGTERM, and it is given a grace period (in the case of knative, it is set to function timeout) to finish current request(s) and terminate gracefully. After the grace period, it is forcefully terminated with SIGKILL. So, the autoscaler works the same way as before; resource utilization is still higher (we need to retain instances in the terminating state until they are finished with requests), but I don't think that's a big issue.

Third, I'm still confused about the warm-up mechanism that you are talking about. If we are discussing creating additional requests to keep the instance warm, I'd say that is not a mechanism but a trick that exploits AWS' keepalive policy. The proper way to ensure the presence of at least a specific number of instances is to request it directly from the provider. I don't see the connection to the original question: the instance with long-running requests won't receive additional requests until it handles the existing one (if the instance can get only one request at a time). So, firing additional requests to keep it warm won't affect the instance's lifetime since they would be routed to another instance.

If something is still unclear, please provide an example of what you need to do/want to see, so we can discuss it.

huasiy commented 1 month ago

Sorry for late response. Actually, I want to have an independent timer for each instance, such that when a request is routed to an instance, it resets this timer. The instance should only be released if no requests are routed to it after a predefined period (like Knative's default 90 seconds), and each incoming request would effectively reset this timer.

leokondrashov commented 1 month ago

So, what you want is not supported in the stock knative version since it contradicts the division of responsibilities built into the knative infrastructure. In knative, Activator knows the available instances and load balances invocations between them, without any control over the instances. In contrast, Autoscaler knows and controls the number of instances and has the information about the invocations as an aggregate (RPS or concurrency, depending on the metric). To implement what you want, you need to let Activator control the lifetime of the instances or make Autoscaler aware of each invocation and every pod.

huasiy commented 1 month ago

Any idea for controlling the lifecycle of instances? Would it be a sound approach to label each pod indicating whether it is currently processing requests?

leokondrashov commented 1 month ago

No ideas that I can think of right now. I don't think labeling would be viable due to the sheer amount of requests you need to send to the k8s in that case (each request coming would create an update of a pod label twice).

huasiy commented 1 month ago

It seems using graceperiod is a feasible approach. If there is any useful clues, I will update this issue. Thanks for your patience.

vhive-serverless / vHive

Guarantee execution time per the invocation request like AWS Lambda #961