[registry-facade] ready-probe-labeler timeout exceeds registry-facade /ready response

su-thomas commented 2 years ago

Bug description

tl;dr: The ready-probe-labeler has a hardcoded timeout of 500ms set for the http.Client that performs the /ready probe check. In my case, the registry-facade response for the /ready endpoint takes ~800ms. The ready-probe-labeler exits with an error because it never sees the 200 response status code due to the premature timeout. This causes the pod to suffer CrashLoopBackOff.

I followed the GitPod Single-Cluster Reference Architecture installation guide for AWS.

After deploying GitPod through the kots installer, I noticed the registry-facade pods were in CrashLoopBackOff status.

Looking at the k8s events, I noticed:

...
Warning  Unhealthy  26m                 kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503`
...

Additionally, the node-labeler container logs showed that it wasn't seeing a 200 response status code from the /ready endpoint:

node-labeler {"level":"info","message":"Starting node labeler...","serviceContext":{"service":"ready-probe-labeler","version":""},"severity":"INFO","time":"2022-08-16T10:48:28Z"} 
node-labeler {"level":"fatal","message":"Unexpected error waiting for probe URL: URL http://localhost:8086/ready never returned status code 200","serviceContext":{"service":"ready-probe-labeler","version":""},"severity":"CRITICAL","time":"2022-08-16T10:49:31Z"}

This initially led me to believe there was a problem with the /ready endpoint on the registry-facade container. So to investigate further, I setup an alpine container in the registry-facade daemonset.

I verified the /ready endpoint was actually working fine within the pod:

# curl -i http://localhost:8086/ready
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Date: Fri, 19 Aug 2022 06:13:40 GMT
Content-Length: 3

{}

So, why wasn't the node-labeler getting a 200 response status code?

After looking through the GitPod source code, I came across the function responsible for checking the probe:

func waitForURLToBeReachable(probeURL string, timeout time.Duration) error {
    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()

    tick := 500 * time.Millisecond

    ticker := time.NewTicker(tick)
    defer ticker.Stop()

    client := &http.Client{
        Timeout: tick,
    }

    for {
        select {
        case <-ctx.Done():
            return xerrors.Errorf("URL %v never returned status code 200", probeURL)
        case <-ticker.C:
            resp, err := client.Get(probeURL)
            if err != nil {
                continue
            }
            defer resp.Body.Close()

            if resp.StatusCode != http.StatusOK {
                continue
            }

            return nil
        }
    }
}

Notably, there is a hardcoded timeout of 500ms set for the http.Client.

Adding some additional logging to this function...:

...
        case <-ticker.C:
            resp, err := client.Get(probeURL)
            if err != nil {
                log.Info("err != nil, err: ", err)
                continue
            }
...

...and running it in the pod:

/gitpod/dev/ready-probe-labeler # go run main.go --label=gitpod.io/registry-facade_ready_ns_gitpod --probe-url=http://localhost:8086/ready
{"level":"info","message":"Starting node labeler...","serviceContext":{"service":"ready-probe-labeler","version":""},"severity":"INFO","time":"2022-08-19T06:24:48Z"}
{"level":"info","message":"err != nil, err: Get \"http://localhost:8086/ready\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","serviceContext":{"service":"ready-probe-labeler","version":""},"severity":"INFO","time":"2022-08-19T06:24:50Z"}
...
...
...
{"level":"info","message":"err != nil, err: Get \"http://localhost:8086/ready\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","serviceContext":{"service":"ready-probe-labeler","version":""},"severity":"INFO","time":"2022-08-19T06:25:49Z"}
{"level":"fatal","message":"Unexpected error waiting for probe URL: URL http://localhost:8086/ready never returned status code 200","serviceContext":{"service":"ready-probe-labeler","version":""},"severity":"CRITICAL","time":"2022-08-19T06:25:49Z"}
exit status 1

...it appears the http.Client is getting interrupted before it receives the headers. This error continues every tick until the timeout duration ends the loop, then the program exits 1. This error exit is likely what is causing the CrashLoopBackOff due to the default restartPolicy of the pod(?).

Benchmarking the /ready endpoint with hey, its clear that it doesn't respond within 500ms (at least in my environment).

# hey -n 300 -m GET http://localhost:8086/ready

Summary:
  Total:    5.0143 secs
  Slowest:  0.9435 secs
  Fastest:  0.6887 secs
  Average:  0.7945 secs
  Requests/sec: 59.8292

  Total data:   900 bytes
  Size/request: 3 bytes

Response time histogram:
  0.689 [1] |
  0.714 [15]    |■■■■■■■
  0.740 [29]    |■■■■■■■■■■■■■
  0.765 [46]    |■■■■■■■■■■■■■■■■■■■■
  0.791 [35]    |■■■■■■■■■■■■■■■
  0.816 [91]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.842 [41]    |■■■■■■■■■■■■■■■■■■
  0.867 [16]    |■■■■■■■
  0.893 [9] |■■■■
  0.918 [4] |■■
  0.944 [13]    |■■■■■■

Latency distribution:
  10% in 0.7262 secs
  25% in 0.7548 secs
  50% in 0.7953 secs
  75% in 0.8191 secs
  90% in 0.8629 secs
  95% in 0.9103 secs
  99% in 0.9402 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0001 secs, 0.6887 secs, 0.9435 secs
  DNS-lookup:   0.0001 secs, 0.0000 secs, 0.0010 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0037 secs
  resp wait:    0.7943 secs, 0.6886 secs, 0.9433 secs
  resp read:    0.0000 secs, 0.0000 secs, 0.0001 secs

Status code distribution:
  [200] 300 responses

If I set the tick to a longer duration such as 1000ms...:

func waitForURLToBeReachable(probeURL string, timeout time.Duration) error {
    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()

    tick := 1000 * time.Millisecond

    ticker := time.NewTicker(tick)
    defer ticker.Stop()

    client := &http.Client{
        Timeout: tick,
    }
...

...then it works successfully:

/gitpod/dev/ready-probe-labeler # go run main.go --label=gitpod.io/registry-facade_ready_ns_gitpod --probe-url=http://localhost:8086/ready
{"level":"info","message":"Starting node labeler...","serviceContext":{"service":"ready-probe-labeler","version":""},"severity":"INFO","time":"2022-08-19T06:35:44Z"}
{"fields.time":11.738146254,"level":"info","message":"node label updated","node":"ip-192-168-102-143.ap-southeast-2.compute.internal","serviceContext":{"service":"ready-probe-labeler","version":""},"severity":"INFO","time":"2022-08-19T06:35:56Z"}

As a manual work around, I removed the problematic node-labeler container from the registry-facade daemonset, and ran the ready-probe-labeler (with 1000ms tick modification) in my alpine container to apply the node label. After doing this, my GitPod deployment seems to function fine, and the registry-facade pod is working as expected with no CrashLoopBackOff.

Steps to reproduce

Follow the GitPod Single-Cluster Reference Architecture installation guide for AWS.
Check logs for registry-facade pods.
Check k8s events.

Workspace affected

No response

Expected behavior

node-labeler should receive a 200 response status code from the registry-facade /ready probe.

Example repository

No response

Anything else?

This issue raises some questions:

Why was a hardcoded timeout of 500ms used for the http.Client() in the ready-probe-labeler?
- From my limited understanding, this seems unnecessarily short. It leaves no alternative way for the pod to "recover" (despite registry-facade actually being healthy) because the exit 1 from the node-labeler container causes a CrashLoopBackOff for the pod. Perhaps the tick could increase in length each loop up to a certain threshold, rather than a fixed length that may be too short.
Why is the /ready endpoint taking ~800ms to respond?
- Is this expected? By design (of the ready-probe-labeler), it would seem that its assumed it can respond in <500ms.

mrsimonemms commented 2 years ago

This is the discussion thread in Discord

aledbf commented 2 years ago

@su-thomas that times are really high. Why 500ms? That is ~10 times the average to complete the request

/tmp $ ./hey_linux_amd64 -n 300 -m GET http://localhost:8086/ready

Summary:
  Total:    0.4252 secs
  Slowest:  0.2342 secs
  Fastest:  0.0164 secs
  Average:  0.0567 secs
  Requests/sec: 705.4881

  Total data:   900 bytes
  Size/request: 3 bytes

Response time histogram:
  0.016 [1] |
  0.038 [46]    |■■■■■■■■■■■
  0.060 [162]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.082 [63]    |■■■■■■■■■■■■■■■■
  0.104 [9] |■■
  0.125 [12]    |■■■
  0.147 [0] |
  0.169 [5] |■
  0.191 [0] |
  0.212 [1] |
  0.234 [1] |

Latency distribution:
  10% in 0.0337 secs
  25% in 0.0426 secs
  50% in 0.0528 secs
  75% in 0.0613 secs
  90% in 0.0772 secs
  95% in 0.1094 secs
  99% in 0.1674 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0007 secs, 0.0164 secs, 0.2342 secs
  DNS-lookup:   0.0005 secs, 0.0000 secs, 0.0039 secs
  req write:    0.0001 secs, 0.0000 secs, 0.0018 secs
  resp wait:    0.0559 secs, 0.0164 secs, 0.2341 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0022 secs

Status code distribution:
  [200] 300 responses

I suggest you check the CPU utilization (I/O in general) in the node where you see this behavior.

Edit: this is from a gitpod.io production node

aledbf commented 2 years ago

This is the discussion thread in Discord

@MrSimonEmms I get Messages failed to load

mrsimonemms commented 2 years ago

@aledbf I think you'll need to sign-up for the self-hosted part of the Discord community to view that

dotwaffle commented 2 years ago

I suggest you check the CPU utilization (I/O in general) in the node where you see this behavior.

I work with @su-thomas and initially built the cluster. It is running with the exact eksctl commands as specified in the single-cluster reference implementation, and the container he launches is in the same pod, so same machine. It would appear that on every connection to the readiness probe, there are lots of checks happening, and that's part of the problem.

It's worth noting that this is running the exact same EC2 instance type as recommended, with 2 nodegroups of 2 nodes each, running nothing other than gitpod, in the ap-southeast-2 (Sydney) region -- so it's entirely possible that if there's a call via the RBAC that AWS are proxying that call to the master back to the US. That might explain the latency.

Either way, a 500ms timeout seems incredibly low for such a check, considering the service takes many seconds to come up in the first place. Even in your tests the timeout is only twice the maximum observed duration!

axonasif commented 2 years ago

This is the discussion thread in Discord

@MrSimonEmms I get Messages failed to load

Hey @aledbf 👋 You can check this loom from 74second timestamp 😄

su-thomas commented 2 years ago

@aledbf, would I be correct in understanding that when a GET request is made (by the node-labeler) for http://127.0.0.1:8086/ready (served by the registry-facade) that then this will call all the readiness checks defined? (source code of heptiolabs/healthcheck here where each check is called by the handler).

There are 3 readiness checks for the registry-facade here. Interestingly, each one of those tests is given 1 second timeout each (here, and here).

It seems contradictory that the node-labeler (client) is hardcoded for a time out for 500ms for the response, whilst the checks (triggered by the request) on the server-side are given 1 second time out (each). Would I be wrong in saying that the maximum time a valid health check could take is upwards of 3 seconds (assuming this source code is running each check after the next without concurrency)? So why is the client (node-labeler) being given only 500ms to get a response from registry-facade?

Again, is there any reason to have a very short timeout fixed for the node-labeler? Is there any reason against letting the node-labeler continue to increase its http.Client timeout on each failed tick? This way if it fails at 500ms, it could next check (for example) 750ms, then 1000ms, so on... up to a defined threshold.

It's possible that the health checks for eu.gcr.io are taking longer for our deployment in ap-southeast-2 (Sydney) region.

kylos101 commented 2 years ago

:wave: @su-thomas do you have a workaround, or, are you unable to use Gitpod at the moment? I assume unable, but, wanted to double check.

aledbf commented 2 years ago

that then this will call all the readiness checks defined?

That's correct.

It seems contradictory that the node-labeler (client) is hardcoded for a time out for 500ms for the response, whilst the checks (triggered by the request) on the server-side are given 1 second time out (each)

Not really. The node-labeler waitForURLToBeReachable has a default timeout of 60s

It's possible that the health checks for eu.gcr.io are taking longer for our deployment in ap-southeast-2 (Sydney) region.

That could be the issue, but then you need to start increasing the timeout of the regisdtry-facade probes timeout.

We removed the internal timeout, using only one. Please check https://github.com/gitpod-io/gitpod/pull/12272. This change will be included in the next self-hosted release.

su-thomas commented 2 years ago

👋 @su-thomas do you have a workaround, or, are you unable to use Gitpod at the moment? I assume unable, but, wanted to double check.

Hi @kylos101, from my initial post:

As a manual work around, I removed the problematic node-labeler container from the registry-facade daemonset, and ran the ready-probe-labeler (with 1000ms tick modification) in my alpine container to apply the node label. After doing this, my GitPod deployment seems to function fine, and the registry-facade pod is working as expected with no CrashLoopBackOff.

su-thomas commented 2 years ago

@aledbf

That could be the issue, but then you need to start increasing the timeout of the registry-facade probes timeout.

The registry-facade probes' timeouts were fine, as you can see in my initial post, I could get a 200 response if I curl http://localhost:8086/ready, and the entire health check was responding in ~800ms. The registry-facade's health probes timeout already allowed up to a total 3 seconds to still be valid, because each of the 3 checks had 1 second each. So these timeouts were fine as is.

Not really. The node-labeler waitForURLToBeReachable has a default timeout of 60s

Yes, a total duration of 60s, but thats not the issue. The issue is that on each attempt within the 60s duration, the http.Client was only given 500ms timeout as per the quoted waitForURLToBeReachable source code in my initial post:

...
    tick := 500 * time.Millisecond
...
    client := &http.Client{
        Timeout: tick,
    }
...

...therefore it didn't matter that there was 60s duration given for the attempts, each attempt was failing, because the http.Client had been limited to 500ms on each attempt.

We removed the internal timeout, using only one. Please check https://github.com/gitpod-io/gitpod/pull/12272. This change will be included in the next self-hosted release.

This looks like it should fix this issue, thanks! 😊

gitpod-io / gitpod