sillsdev / languageforge-lexbox

Lexbox, SIL linguistic data hub
MIT License
7 stars 2 forks source link

Avoid failed-too-soon health checks on startup #1212

Closed rmunn closed 2 weeks ago

rmunn commented 2 weeks ago

Fix #1211.

If the HTTP call to /api/healthz is aborted due to Kubernetes expecting it to take 1 second or less (default timeout for heatlh checks in Kubernetes is 1 second unless you set it to a higher value), we're currently considering the TaskCanceledException to be a DB connection failure and therefore an unhealthy pod. But on first startup, the fw-headless container takes longer to respond to the first health check (I measured 1.5 seconds on my dev laptop) because of needing to open a new database connection, any first-time setup that EF Core needs to do, and so on. So we'll lengthen the Kubernetes timeout to 5 seconds so that ASP.NET doesn't get a TaskCanceledException and think it's in an unhealthy state when everything was fine.

github-actions[bot] commented 2 weeks ago

C# Unit Tests

75 tests   75 :white_check_mark:  5s :stopwatch: 13 suites   0 :zzz:  1 files     0 :x:

Results for commit 5b08ba64.

hahn-kev commented 2 weeks ago

If I remember correctly the token will be cancelled if the health check times out (default 30s) then it will cancel the request, if the simple query can't finish in 30s then it should be considered failed.

rmunn commented 2 weeks ago

Turns out the TaskCanceledException can be thrown when a health check is considered timed out, and we do want it to count as a failure. But Kubernetes has a default of 1 second, while the FwHeadless container often needs longer to connect to the DB on first startup (I see about 1.5 to 1.6 seconds routinely). Upping the k8s timeout value (as done in commit 8bbdac90) should fix this, with no code changes needed.