projectkudu / kudu

Kudu is the engine behind git/hg deployments, WebJobs, and various other features in Azure Web Sites. It can also run outside of Azure.
Apache License 2.0
3.12k stars 655 forks source link

stopping_wait_time / 3 min not respected on scale-in #3488

Closed Bouke closed 5 months ago

Bouke commented 11 months ago

I have a continuous WebJob that runs on multiple instances. I want to enable automatic scaling on these instances. The instances are running long-running tasks, but the number tasks varies throughout the day. I can come up with an appropriate scaling rule. On scale-in a (seemingly) random instance is picked that will be shutdown, while it might still be doing a long-running task. As per the docs this is a hard shutdown and the instance has 3 minutes to finish up their work:

The App Service Plan's instance count was reduced and some VMs need to be removed and the shutdown time limit is 3 mins.

However based on my testing, the instance is killed within 30 seconds of signalling the graceful shutdown. Not 3 minutes.

As an aside, this is far from ideal: I'd rather be able to specify a graceful shutdown duration of a few hours so that I don't have to restart this work on another instance. I mean there's no hurry in shutting the instance down for Microsoft: I'll still be paying for that instance until it is off.

Repro steps.

  1. Create a WebJob in .NET Framework with the following implementation

Program.cs

using System;
using System.Diagnostics;
using System.Threading;
using Microsoft.Azure.WebJobs;

namespace WebJobScalingBehaviour
{
    internal class Program
    {
        public static void Main(string[] args)
        {
            var watch = Stopwatch.StartNew();
            var watcher = new WebJobsShutdownWatcher();

            while (!watcher.Token.IsCancellationRequested)
            {
                Log($"running for {watch.Elapsed} until cancelled, sleeping for 1 minute");
                watcher.Token.WaitHandle.WaitOne(TimeSpan.FromMinutes(1));
            }

            Log($"cancellation was requested at {watch.Elapsed}, how long can we continue running?");

            watch.Restart();

            while (true)
            {
                Log($"still running for {watch.Elapsed} since cancelling, sleeping for 1 second");
                Thread.Sleep(TimeSpan.FromSeconds(1));
            }

            void Log(string message)
            {
                Console.WriteLine($"{Environment.MachineName} {message}");
            }
        }
    }
}

settings.job

{
    "stopping_wait_time": 3600
}
  1. Run it on a B1 with 2 instances.
  2. Wait until the log output shows 2 instances running.
  3. Scale-in to 1 instance.
  4. Wait until the shutdown shows up in the logs.

Expected behaviour

The job keeps running for the full 3 minutes.

Desired behaviour

The job keeps running for stopping_wait_time that I specify up to a few hours.

Actual behaviour

The job's last output lines show that it didn't even last 30 seconds to get killed:

2023-10-11T18:42:13  PID[7008] Information dw1sdwk00033X still running for 00:00:26.2276687 since cancelling, sleeping for 1 second
2023-10-11T18:42:14  PID[7008] Information dw1sdwk00033X still running for 00:00:27.2289728 since cancelling, sleeping for 1 second

Project structures.

in order to reproduce your issue at our end we need a simple github repository that highlights structure of the project

The log/error given by the failure.

Normally this include a stack trace, error code and some more information.

Debug your Azure website remotely.

it is recommanded that you share your Web App name, directly or indirectly we can take a look at what's going on.

Mention any other details that might be useful.


Thanks! We'll be in touch soon.

jvano commented 5 months ago

Hi

If the problem persists and is related to running it on Azure App Service, please open a support incident in Azure: https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request

This way we can better track and assist you on this case

Thanks,

Joaquin Vano Azure App Service