github / glb-director

GitHub Load Balancer Director and supporting tooling.
Other
2.37k stars 227 forks source link

Add a timeout to http healthchecker connections #124

Closed pavantc closed 2 years ago

pavantc commented 3 years ago

The healthchecker is the source of truth for the health status of the glb-proxy nodes to which traffic is directed from the director. The healthchecker is configured to perform HTTP healthchecks on the glb-proxies via http.Get() of the configured URI. The connection setup for this http.Get() however does not configure a timeout. As a result, a bad proxy can, over time (a matter of minutes) cause a large number of open connections in the healthchecker leading to it hitting the MAX open files limit and subsequently failing connecting to any glb-proxy and thus marking every proxy as failed.

See related threads of discussion here: https://github.slack.com/archives/CGNJD88A2/p1627908515043000 https://github.slack.com/archives/CGNJD88A2/p1627909019050400

/cc @shelson @awlx @theojulienne

pavantc commented 3 years ago

Local tests:

$ script/test
Building...
2021-08-03T06:40:48Z
Beginning tests
=== GO TEST ===
=== RUN   TestDampenedResultStreamNoDampening
--- PASS: TestDampenedResultStreamNoDampening (0.00s)
=== RUN   TestDampenedResultStreamImmediateHealthySlowFailure
--- PASS: TestDampenedResultStreamImmediateHealthySlowFailure (0.00s)
=== RUN   TestDampenedResultStreamImmediateFailureSlowHealth
--- PASS: TestDampenedResultStreamImmediateFailureSlowHealth (0.00s)
=== RUN   TestDampenedResultStreamSlowFailureSlowHealth
--- PASS: TestDampenedResultStreamSlowFailureSlowHealth (0.00s)
PASS
ok      github.com/github/glb-director/src/glb-healthcheck      0.752s
=== SHELL TEST ===
test: service starts up, exposes debug vars ...                    OK (0s)
test: outputs the healthcheck file with valid health ...           OK (3s)
test: reload should take effect ...                                OK (6s)
test: responds to health check changes ...                         OK (30s)
===
2021-08-03T06:41:29Z
Done.
pavantc commented 3 years ago

Article explains a common pitfall with http.Get(): https://medium.com/@nate510/don-t-use-go-s-default-http-client-4804cb19f779

pavantc commented 3 years ago

Splunk query showing all failed hosts logged

Screen Shot 2021-08-03 at 5 37 24 AM
shelson commented 3 years ago

Is it possible to unit-test this code so we know it does what we think it will do?

EriGWorld commented 3 years ago

It is nice to patch the HTTP checker to behave though i would argue the problem is with the manager itself leaking resources. That should be the RCA target IMO. The manager job should be to track/report state and to manage/stop these tests if they have not returned in the time-slot allocated.

pavantc commented 3 years ago

It is nice to patch the HTTP checker to behave though i would argue the problem is with the manager itself leaking resources. That should be the RCA target IMO. The manager job should be to track/report state and to manage/stop these tests if they have not returned in the time-slot allocated.

The resource here is a socket created implicitly by the use of the http library's http.Get() routine that's specific to the HttpHealthChecker.go module. And the fix is also sort of a "indirect" resource management. I mean, there is no explicit create/open/close of resources to come up with a resource management scheme in the manager.

pavantc commented 3 years ago

Is it possible to unit-test this code so we know it does what we think it will do?

Very relevant @shelson. Looking for a simple way to do this verification..

pavantc commented 3 years ago

@shelson would this do:

I added the filecounts of the glb-healthcheck process after starting the tests with the following code changes to the test script:

bash-3.2$ git diff test/
diff --git a/src/glb-healthcheck/test/test-basic.sh b/src/glb-healthcheck/test/test-basic.sh
index 1e12cf0..b41bba1 100644
--- a/src/glb-healthcheck/test/test-basic.sh
+++ b/src/glb-healthcheck/test/test-basic.sh
@@ -163,6 +163,7 @@ begin_test "responds to health check changes"
 (
   setup

+  echo "FILE COUNT BEFORE TEST: $(lsof -p `pgrep glb-healthcheck` | wc -l)"
   sleep 3

   # local backend is now marked unhealthy, port 8765 not responding
@@ -190,5 +191,8 @@ begin_test "responds to health check changes"
   # wait for total >3 HC rounds and also over 10s hold period, should then trigger unhealthy
   sleep 8
   [[ "$(jq -r '.tables[1].backends[3].healthy' $TEMPDIR/forwarding_table.hc.json)" == "false" ]]
+
+  sleep 3
+  echo "FILE COUNT AFTER TEST: $(lsof -p `pgrep glb-healthcheck` | wc -l)"
 )
 end_test

Results with fix reverted:

bash-3.2$ git diff HttpHealthChecker.go && KEEPTRASH=1 ./script/test && grep "FILE COUNT" /tmp/glbhc-*/out && rm -rf /tmp/glbhc-*
diff --git a/src/glb-healthcheck/HttpHealthChecker.go b/src/glb-healthcheck/HttpHealthChecker.go
index 929e7db..73bcdae 100644
--- a/src/glb-healthcheck/HttpHealthChecker.go
+++ b/src/glb-healthcheck/HttpHealthChecker.go
@@ -55,10 +55,10 @@ func httpCheckURL(url string, timeoutSec time.Duration) HealthResultStream {

        go func() {
                httpCounters.Add("Checks", 1)
-               var httpClient = &http.Client {
-                       Timeout: timeoutSec,
-               }
-               resp, err := httpClient.Get(url)
+               //var httpClient = &http.Client {
+               //      Timeout: timeoutSec,
+               //}
+               resp, err := http.Get(url)
                if err != nil {
                        ch <- HealthResult{Healthy: false, Failure: err.Error()}
                        close(ch)
Building...
2021-08-03T11:38:52Z
Beginning tests
=== GO TEST ===
=== RUN   TestDampenedResultStreamNoDampening
--- PASS: TestDampenedResultStreamNoDampening (0.00s)
=== RUN   TestDampenedResultStreamImmediateHealthySlowFailure
--- PASS: TestDampenedResultStreamImmediateHealthySlowFailure (0.00s)
=== RUN   TestDampenedResultStreamImmediateFailureSlowHealth
--- PASS: TestDampenedResultStreamImmediateFailureSlowHealth (0.00s)
=== RUN   TestDampenedResultStreamSlowFailureSlowHealth
--- PASS: TestDampenedResultStreamSlowFailureSlowHealth (0.00s)
PASS
ok      github.com/github/glb-director/src/glb-healthcheck      0.423s
=== SHELL TEST ===
test: service starts up, exposes debug vars ...                    OK (1s)
test: outputs the healthcheck file with valid health ...           OK (3s)
test: reload should take effect ...                                OK (7s)
test: responds to health check changes ...                         OK (32s)
===
2021-08-03T11:39:36Z
Done.
FILE COUNT BEFORE TEST:       21
FILE COUNT AFTER TEST:       55

Results with fix:

bash-3.2$ git diff HttpHealthChecker.go && KEEPTRASH=1 ./script/test && grep "FILE COUNT" /tmp/glbhc-*/out && rm -rf /tmp/glbhc-*
Building...
2021-08-03T11:36:16Z
Beginning tests
=== GO TEST ===
=== RUN   TestDampenedResultStreamNoDampening
--- PASS: TestDampenedResultStreamNoDampening (0.00s)
=== RUN   TestDampenedResultStreamImmediateHealthySlowFailure
--- PASS: TestDampenedResultStreamImmediateHealthySlowFailure (0.00s)
=== RUN   TestDampenedResultStreamImmediateFailureSlowHealth
--- PASS: TestDampenedResultStreamImmediateFailureSlowHealth (0.00s)
=== RUN   TestDampenedResultStreamSlowFailureSlowHealth
--- PASS: TestDampenedResultStreamSlowFailureSlowHealth (0.00s)
PASS
ok      github.com/github/glb-director/src/glb-healthcheck      0.763s
=== SHELL TEST ===
test: service starts up, exposes debug vars ...                    OK (0s)
test: outputs the healthcheck file with valid health ...           OK (3s)
test: reload should take effect ...                                OK (6s)
test: responds to health check changes ...                         OK (33s)
===
2021-08-03T11:37:00Z
Done.
FILE COUNT BEFORE TEST:       21
FILE COUNT AFTER TEST:       21
EriGWorld commented 3 years ago

It is nice to patch the HTTP checker to behave though i would argue the problem is with the manager itself leaking resources. That should be the RCA target IMO. The manager job should be to track/report state and to manage/stop these tests if they have not returned in the time-slot allocated.

The resource here is a socket created implicitly by the use of the http library's http.Get() routine that's specific to the HttpHealthChecker.go module. And the fix is also sort of a "indirect" resource management. I mean, there is no explicit create/open/close of resources to come up with a resource management scheme in the manager.

You can poll on the http resource context but requires more restructuring. LGTM for now!

How do you feel the other modules i.e. TCP is handling this scenario?