Closed rmunn closed 2 weeks ago
Turns out the issue is just that Kubernetes defaults to 1s for its startup health checks and when fw-headless is first loading, it takes around 1.5 seconds for it to do everything it needs to do to set up a database connection and get results. (Probably some EF Core first-time setup going on, besides opening a new DB connection to Postgres). So we just need to lengthen the k8s timeout, so that a canceled HTTP request from k8s isn't turning into a TaskCanceledException in C# and causing spurious "health check failed" logs when in fact the DB was just fine.
Describe the bug I randomly get warnings like the following in the FwHeadless logs:
Interestingly, although the logs say that the application is shutting down, it is in fact running (or another pod has just started but I'm still seeing the logs form the pod that shut down). At any rate, hitting the
/sync
API with a POST request works correctly, and the logs then show the response to that POST request so it's unlikely that I'm looking at logs from an older pod.I notice that the failing LexBoxDbContext check looks like this:
It makes sense for most exceptions from
Users.CountAsync
to count as health check failures. But according to https://stackoverflow.com/questions/60474213/asp-net-core-healthchecks-randomly-fails-with-taskcanceledexception-or-operation, the health check's cancellation token will throw TaskCanceledException if the HTTP request for/api/healthz
was aborted; this can happen if a second health check runs before the first one is completed. Which explains why I usually see this after I start up a fresh FwHeadless container.I think we should be catching TaskCanceledException and returning
true
from the LexBoxDbContext health check if the task was canceled.To Reproduce Can't reproduce consistently, as it's timing dependent. But I most often see this after running
tilt up
ortask up
for the first time.Expected behavior Aborted HTTP requests to
/api/healthz
should not result in an unhealthy status check result: we should catch TaskCanceledException and returntrue
orHealthy
from the health-check method so that ASP.NET doesn't log spurious health check failures.