sillsdev / languageforge-lexbox

Lexbox, SIL linguistic data hub
MIT License
7 stars 2 forks source link

Spurious warnings about DbContext health checks failing in FwHeadless container #1211

Closed rmunn closed 2 weeks ago

rmunn commented 3 weeks ago

Describe the bug I randomly get warnings like the following in the FwHeadless logs:

fail: Microsoft.EntityFrameworkCore.Database.Connection[20004]
      An error occurred using the connection to database 'lexbox' on server 'tcp://db:5432'.
fail: Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService[103]
      Health check LexBoxDbContext with status Unhealthy completed after 1289.0786ms with message 'A task was canceled.'
      System.Threading.Tasks.TaskCanceledException: A task was canceled.
         at Npgsql.TaskTimeoutAndCancellation.ExecuteAsync(Func`2 getTaskFunc, NpgsqlTimeout timeout, CancellationToken cancellationToken)
         ...
         at LexData.LexBoxDbContext.HeathCheck(CancellationToken cancellationToken) in /src/backend/LexData/LexBoxDbContext.cs:line 38
         at Microsoft.Extensions.Diagnostics.HealthChecks.DbContextHealthCheck`1.CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken)
info: Microsoft.Hosting.Lifetime[0]
      Application is shutting down...

Interestingly, although the logs say that the application is shutting down, it is in fact running (or another pod has just started but I'm still seeing the logs form the pod that shut down). At any rate, hitting the /sync API with a POST request works correctly, and the logs then show the response to that POST request so it's unlikely that I'm looking at logs from an older pod.

I notice that the failing LexBoxDbContext check looks like this:

    public async Task<bool> HeathCheck(CancellationToken cancellationToken)
    {
        //this will throw if we can't connect which is a valid health check response.
        await Users.CountAsync(cancellationToken);
        return true;
    }

It makes sense for most exceptions from Users.CountAsync to count as health check failures. But according to https://stackoverflow.com/questions/60474213/asp-net-core-healthchecks-randomly-fails-with-taskcanceledexception-or-operation, the health check's cancellation token will throw TaskCanceledException if the HTTP request for /api/healthz was aborted; this can happen if a second health check runs before the first one is completed. Which explains why I usually see this after I start up a fresh FwHeadless container.

I think we should be catching TaskCanceledException and returning true from the LexBoxDbContext health check if the task was canceled.

To Reproduce Can't reproduce consistently, as it's timing dependent. But I most often see this after running tilt up or task up for the first time.

Expected behavior Aborted HTTP requests to /api/healthz should not result in an unhealthy status check result: we should catch TaskCanceledException and return true or Healthy from the health-check method so that ASP.NET doesn't log spurious health check failures.

rmunn commented 2 weeks ago

Turns out the issue is just that Kubernetes defaults to 1s for its startup health checks and when fw-headless is first loading, it takes around 1.5 seconds for it to do everything it needs to do to set up a database connection and get results. (Probably some EF Core first-time setup going on, besides opening a new DB connection to Postgres). So we just need to lengthen the k8s timeout, so that a canceled HTTP request from k8s isn't turning into a TaskCanceledException in C# and causing spurious "health check failed" logs when in fact the DB was just fine.