projectkudu / kudu

Kudu is the engine behind git/hg deployments, WebJobs, and various other features in Azure Web Sites. It can also run outside of Azure.
Apache License 2.0
3.12k stars 652 forks source link

Provide a way to receive alerts if the webjob host failed to start a job #2789

Closed paulbatum closed 6 months ago

paulbatum commented 6 years ago

From @xt0rted on June 8, 2018 23:16

If you deploy a webjob and it fails to start for any reason, including one out of your control https://github.com/aspnet/websdk/issues/347, there should be a way to get alerted of this. My site has the AI extension installed, the site & webjobs use AI & Raygun for error monitoring, but none of this picks up issues when the webjob host fails to run the job. I've run into this once before (#1619) and it's incredibly frustrating to find out hours or days later instead of immediately.

I'd love to see these issues get surfaced into AI, as well as some way to tell a 3rd party system such as Slack, Raygun, or Bugsnag that there was a problem.

Repro steps

  1. Publish a full framework webjob with a run.cmd that contains dotnet WebJob.exe
  2. Watch it never startup
  3. Wonder how you can get alerted of this type of issue

Expected behavior

At the very least the portal should give me an error alert that the webjob is failing to start

Actual behavior

Nothing happens, everything continues on as normal

Known workarounds

None

Related information

Copied from original issue: Azure/azure-webjobs-sdk#1742

paulbatum commented 6 years ago

@xt0rted I moved this from the webjobs sdk repo because this issue is really about the webjobs functionality that exists in kudu and not the webjobs SDK itself.

Have you experimented with approaches where your webjob would emit some type of heartbeat log to app insights, and you would write an alert against your app insights instance that alerts when the heartbeat is not present?

davidebbo commented 6 years ago

At the very least the portal should give me an error alert that the webjob is failing to start

I'm surprised about that part. If it fails to start, the Portal should not be telling you that the WebJob is running.

xt0rted commented 6 years ago

The status of the jobs in the portal was something like starting up or restarting. At first glance it seems like everything is fine because of the way it's worded. If it was failed to start, error starting, or error starting - retrying then that'd be much more helpful.

What I was referring to with the portal error is something like the alerts that show in the top right when you login that say you have xx credits remaining or when you save your app settings. I've received those a number of times saying api calls were being throttled (I think for deployments).

If app insights was used inside the webjob host (the process that discovers & runs them, not the JobHost class) then that could log job failures, which would then show in the failures blade/failed requests list. I'm sure this could also be used to setup alerts, but I've yet to figure those out.

davidebbo commented 6 years ago

Yep, starting up or restarting is what I would expect. Just wanted to confirm it didn't say running.

But other than that, yes, I agree that the lack of alerting is not ideal. I don't think there is a great solution right now.

OskarKlintrot commented 6 years ago

Since you are using AI you can do something like this in the meantime:

public class Functions
{
    public static async Task MyWebJobAsync([TimerTrigger("0 0 2 * * *", RunOnStartup = true)] TimerInfo timer = null)
    {
        using (var webJob = new TrackWebJob())
        {
            try
            {
                // Do something
            }
            catch (Exception e)
            {
                webJob.Failed(e);
            }
        }
    }
}
public class TrackWebJob : IDisposable
{
    private readonly TelemetryClient telemetry = new TelemetryClient(); // Should be reused!!!
    private readonly IOperationHolder<DependencyTelemetry> operation;

    public TrackWebJob([CallerMemberName] string name = null)
    {
        var dependencyTelemetry = new DependencyTelemetry
        {
            Name = name,
            Type = "WebJob",
        };

        operation = telemetry.StartOperation(dependencyTelemetry);
    }

    public void Failed(Exception e)
    {
        operation.Telemetry.Success = false;

        // Log exception here
    }

    public void Dispose()
    {
        operation.Dispose();
    }
}

Now you can set alerts for failed calls for dependency type WebJob.

Disclaimer: I tried to simplify what I actually use in production so it's not tested and might need some extra work but it's enough to show what I'm aiming for. Also note that the dashboard will show all runs as success unless you throw the exception again.

richardbartley commented 4 years ago

Is there an Azure REST api call that will let us query the status of the webjob? Some api here https://docs.microsoft.com/en-us/rest/api/appservice/webapps/listwebjobs but which one would give the status?

Maybe this https://docs.microsoft.com/en-us/rest/api/appservice/webapps/listwebjobs#code-try-0

jvano commented 6 months ago

Hi

If the problem persists and is related to running it on Azure App Service, please open a support incident in Azure: https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request

This way we can better track and assist you on this case

Thanks,

Joaquin Vano Azure App Service